DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

Sequences seem to be erroneously classified as Taxonomy ID 1 (root) when -k 1 option is used. #195

Open ryotag opened 3 years ago

ryotag commented 3 years ago

Hello,

When I used a very large value for -k, e.g., -k 1000, I got the following classification result for a sequence.

readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
seq1 NZ_CP009773.1 36549 43046721 43046721 6576 12848 7
seq1 NZ_CP010363.1 36549 43046721 43046721 6576 12848 7
seq1 NZ_CP006926.1 36549 43046721 43046721 6576 12848 7
seq1 NC_025131.1 36549 43046721 43046721 6576 12848 7
seq1 NZ_CP009776.1 36549 43046721 43046721 6576 12848 7
seq1 NZ_CP015501.1 36549 43046721 43046721 6576 12848 7
seq1 KP345882.1 36549 43046721 43046721 6576 12848 7
However, when I used -k 1, the sequence could not be correctly classified as taxid 36549 but was classified as taxid 1 (root). readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
seq1 no rank 1 43046721 0 6576 12848 1

It seems this problem occurs only when there are multiple hits for taxid 36549, because the taxid was correctly assigned for a sequence when only one hit was reported for a sequence as shown below.

readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
seq2 NZ_CP007734.1 36549 20857104 0 4842 6607 1

I tried to solve the problem and found that when I modified the corresponding line of nodes.dmp file as follows,

from

36549 | 28384 | no rank | | 0 | 0 | 11 | 1 | 0 | 1 | 0 | 0 | |

to

36549 | 28384 | species | | 0 | 0 | 11 | 1 | 0 | 1 | 0 | 0 | |

I could get a correct assignment (taxid 36549) for the read even when I used -k 1.

readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
seq1 species 36549 43046721 0 6576 12848 1

I downloaded nodes.dmp from NCBI taxonomy page and taxid 36549 is "plasmids". Is this a bug? I mean, it seems that when there are multiple hits for a taxid with "no rank" and when a very small value was used for-k, Centrifuge seems to erroneously classify the sequence as taxid 1 (root).