genomic-medicine-sweden / taxprofiler

Taxonomic profiling of shotgun metagenomic data
https://nf-co.re/taxprofiler
MIT License
0 stars 0 forks source link

Test CENTRIFUGE with classification filterings, and compare centrifuge to Kraken2 #28

Open LilyAnderssonLee opened 11 months ago

LilyAnderssonLee commented 11 months ago

Centrifuge uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index.

Run Centrifuge and Kraken2 for the samples within the clinical case #374764. In the DNA sample, we have confirmed that 9 reads were assigned to HHV7, and these reads were identified as true positives

TO ANSWER: 1: Can we also detect HHV7 using Centrifuge, and how many reads were assigned to it?

2: Which classifier identifies more organisms?

3: Which classifier has more false positives in the assigned reads when validated through blasting?"

LilyAnderssonLee commented 11 months ago

Conclusion:

1: Kraken2 and Centrifuge assigned the same 9 reads to HHV7 and one read of HHV4.

2: Centrifuge assigned 98 reads to the Viruses category, while Kraken2 assigned 33 reads to the same category.

3: Centrifuge predicted more Virus species (56) compared to Kraken2 (24). However, Centrifuge showed significantly higher false positives, as indicated by blast.

4: The false positives of Centrifuge can be reduced by considering hitLength and numMatches in the classification. For instance, Alcelaphine herpesvirus 1 was falsely reported by Centrifuge, but the hitLength was only 23 bp, which is far from the reliable standard. I would suggest setting the hitLength between 50 bp to 100 bp to reduce false positives. As for numMatches, we could skip it for now since I am not sure about species sharing the same genomic regions.

5: Centrifuge assigned 9 reads to Human endogenous retrovirus K113, which were not reported by Kraken2. These 9 reads were confirmed by BLAST.

sofstam commented 1 month ago

During ENNGS workshop, it was mentioned that one lab is using 5000 as quality filter.

LilyAnderssonLee commented 1 month ago

Good to know this. I think we need to test all parts in taxprofiler we are using by simulated data. I am thinking if we need to add this to the taxprofiler validation report.

sofstam commented 1 month ago

Since we have not tested this metric, I would suggest to wait to the next version of validation.

LilyAnderssonLee commented 1 month ago

Yes, it makes sense. We need to do this validation based on simulated data to improve the performance of the taxprofiler and to meet the requirements of IVDR, perhaps sometime in late autumn or winter, after the major release of taxprofiler for long reads.