DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

issue with a specific taxon (ID: 853) #192

Closed Aiswarya-prasad closed 3 years ago

Aiswarya-prasad commented 4 years ago

I found an important taxon (853) that is widely reported in many studies to be missing in the Centrifuge report.

In the output, those reads that Kraken2 had classified as taxID 853, are classified (2176/4994 classified by Kraken2) as seen below:

readID  seqID   taxID   score   2ndBestScore    hitLength   queryLength numMatches

aa1dca38-5898-4d2d-8915-d5992c0abf18    cid|84030   84030   4723    0   183 468 1

af98e172-9b92-4713-82fa-91cff0219f14    cid|187327  187327  829 0   67  29854   1

c477519c-eb11-424a-b6f4-59ff07a4f27c    cid|84030   84030   2034    0   148 637 1

e7b82fea-5cf0-4bad-97e2-28f09ed19cb6    cid|1834198 1834198 1165    0   104 1011    1

23479fbe-447e-42bc-8218-6dbd30f1fc5a    cid|39485   39485   9169    2601    137 22530   1

d86cf3bf-344e-4998-901d-5d07a54a5ad1    cid|301301  301301  1733    1024    85  580 1

5ddfae2e-acd8-43b1-be9e-1c74672861ec    cid|84030   84030   2197    0   95  678 1

726dba41-0a06-46e6-9d6a-13d211924e6e    cid|39491   39491   2245    1022    97  550 1

I could not find this taxon (853) in the database by using centrifuge-inspect and grep on the output.

These matches seem to have a good score and hitLength but do not correlate with Kraken2. Does this mean that they should be disregarded? I understand that it may not be easy to compare two tools like this especially since different databases are involved but this makes leaves me at a tough spot where I am unable to decide which results to go with especially since I know that this taxon has been widely reported by many 16S rRNA based studies (mine is nanopore shotgun data).

Also, this makes me worry that this may be happening with other taxons too.

This is an issue with centrifuge-1.0.3-beta.

mourisl commented 4 years ago

Which centrifuge index are you using?

Aiswarya-prasad commented 4 years ago

It's the p_compressed+h+v index from

ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz

mourisl commented 4 years ago

That index might be outdated. Can you try the newer one created by other researchers such as: https://zenodo.org/record/3732127/files/h+p+v+c.tar.gz?download=1 ?

Aiswarya-prasad commented 4 years ago

That index might be outdated. Can you try the newer one created by other researchers such as: https://zenodo.org/record/3732127/files/h+p+v+c.tar.gz?download=1 ?

I will try this. Thank you. Where can I find more information about this index?