DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

unable to prevent least common ancestor classification #174

Closed mmp3 closed 5 years ago

mmp3 commented 5 years ago

I am trying to prevent least common ancestor classification altogether and instead have centrifuge report relative abundance only for strains/genomes ("leaf"). I understand that in many cases this will be very, very slow. But in certain scenarios, this is desired.

I tried option -a to report all hits, but the report file still reports abundance at the species-level for some taxa. I tried option -k 100000 but the report file still has abundance at the species-level for some taxa.

For example, I always get an abundance for species Escherichia coli (taxa id = 562), but then also get relative abundances for some E. coli strains (leaf). But I don't want any read classifications to be promoted to species level, I want only leaf level labels for read, even if that means that a read gets hundreds of leaf labels.

How do I force centrifuge to record all possible mapping positions for each read so that the relative abundances in report are only for leaf, not species or anything higher?

mmp3 commented 5 years ago

resolved. This behavior is caused by using the refseq database in p+h+v. The reason is that in the refseq database, many genomes are assigned the same taxonomy id, as observed in centrifuge-inspect --conversion-table.