genomic-medicine-sweden / taxprofiler

Taxonomic profiling of shotgun metagenomic data
https://nf-co.re/taxprofiler
MIT License
0 stars 1 forks source link

Test the profiler Kaiju #29

Open LilyAnderssonLee opened 1 year ago

LilyAnderssonLee commented 1 year ago

Kaiju finds maximum (in-)exact matches on the protein-level using the Burrows–Wheeler transform.

TO DO Compare the Kaiju classification to Kraken2 (genome-level) based on clinical case #374764.

LilyAnderssonLee commented 1 year ago

Kaiju database: Protein sequences from genome assemblies of Archaea and bacteria with assembly level "Complete Genome", as well as viral protein sequences from NCBI RefSeq.

Kraken2 database includes archaea, bacteria, viral, human, UniVec_Core, protozoa and fungi.

Inference:

For a DNA sample: There were 1017 reads assigned to Viruses, a significantly larger number compared to Kraken2, which had only 33 reads assigned to Viruses. This disparity aligns with the observations made in Kaiju's original paper. However, I suspect one of key factors contributing to this difference is that the Kaiju database lacks human genome sequences, which typically constitute a substantial portion—up to 99%—of the total classified reads. Furthermore, many predicted virus reads were identified as human reads by BLAST.

This finding is consistent with the Kaiju original paper's assertion that Kaiju can classify up to 10 times more reads in actual metagenomic scenarios. But it also has higher false positives than Kraken2 in our case.

Update: Databases play very important roles in classification, please see Major data analysis errors invalidate cancer microbiome findings

To address these discrepancies, it is wise to prepare Kaiju DB by including human, fungi, and plasmid sequences to re-perform the comparisons.

LilyAnderssonLee commented 7 months ago

Test Kaiju on validation samples using the Kraken2Seq_kaiju database, which is built on the same sequence IDs as the Kraken2 database: k2_pluspf_20231009.