Open LilyAnderssonLee opened 1 year ago
Kaiju database: Protein sequences from genome assemblies of Archaea and bacteria with assembly level "Complete Genome", as well as viral protein sequences from NCBI RefSeq.
Kraken2 database includes archaea, bacteria, viral, human, UniVec_Core, protozoa and fungi.
Inference:
For a DNA sample:
There were 1017 reads assigned to Viruses, a significantly larger number compared to Kraken2, which had only 33 reads assigned to Viruses. This disparity aligns with the observations made in Kaiju's original paper. However, I suspect one of key factors contributing to this difference is that the Kaiju database lacks human genome sequences, which typically constitute a substantial portion—up to 99%—of the total classified reads. Furthermore, many predicted virus reads were identified as human reads by BLAST
.
This finding is consistent with the Kaiju original paper's assertion that Kaiju can classify up to 10 times more reads in actual metagenomic scenarios. But it also has higher false positives than Kraken2 in our case.
Update: Databases play very important roles in classification, please see Major data analysis errors invalidate cancer microbiome findings
To address these discrepancies, it is wise to prepare Kaiju DB by including human, fungi, and plasmid sequences to re-perform the comparisons.
Test Kaiju
on validation samples using the Kraken2Seq_kaiju
database, which is built on the same sequence IDs as the Kraken2
database: k2_pluspf_20231009
.
Kaiju finds maximum (in-)exact matches on the protein-level using the Burrows–Wheeler transform.
TO DO Compare the Kaiju classification to Kraken2 (genome-level) based on clinical case #374764.