When I used a very large value for -k, e.g., -k 1000, I got the following classification result for a sequence.
readID
seqID
taxID
score
2ndBestScore
hitLength
queryLength
numMatches
seq1
NZ_CP009773.1
36549
43046721
43046721
6576
12848
7
seq1
NZ_CP010363.1
36549
43046721
43046721
6576
12848
7
seq1
NZ_CP006926.1
36549
43046721
43046721
6576
12848
7
seq1
NC_025131.1
36549
43046721
43046721
6576
12848
7
seq1
NZ_CP009776.1
36549
43046721
43046721
6576
12848
7
seq1
NZ_CP015501.1
36549
43046721
43046721
6576
12848
7
seq1
KP345882.1
36549
43046721
43046721
6576
12848
7
However, when I used -k 1, the sequence could not be correctly classified as taxid 36549 but was classified as taxid 1 (root).
readID
seqID
taxID
score
2ndBestScore
hitLength
queryLength
numMatches
seq1
no rank
1
43046721
0
6576
12848
1
It seems this problem occurs only when there are multiple hits for taxid 36549, because the taxid was correctly assigned for a sequence when only one hit was reported for a sequence as shown below.
readID
seqID
taxID
score
2ndBestScore
hitLength
queryLength
numMatches
seq2
NZ_CP007734.1
36549
20857104
0
4842
6607
1
I tried to solve the problem and found that when I modified the corresponding line of nodes.dmp file as follows,
I could get a correct assignment (taxid 36549) for the read even when I used -k 1.
readID
seqID
taxID
score
2ndBestScore
hitLength
queryLength
numMatches
seq1
species
36549
43046721
0
6576
12848
1
I downloaded nodes.dmp from NCBI taxonomy page and taxid 36549 is "plasmids".
Is this a bug? I mean, it seems that when there are multiple hits for a taxid with "no rank" and when a very small value was used for-k, Centrifuge seems to erroneously classify the sequence as taxid 1 (root).
Hello,
When I used a very large value for
-k
, e.g.,-k 1000
, I got the following classification result for a sequence.-k 1
, the sequence could not be correctly classified as taxid 36549 but was classified as taxid 1 (root). readIDIt seems this problem occurs only when there are multiple hits for taxid 36549, because the taxid was correctly assigned for a sequence when only one hit was reported for a sequence as shown below.
I tried to solve the problem and found that when I modified the corresponding line of nodes.dmp file as follows,
from
to
I could get a correct assignment (taxid 36549) for the read even when I used
-k 1
.I downloaded nodes.dmp from NCBI taxonomy page and taxid 36549 is "plasmids". Is this a bug? I mean, it seems that when there are multiple hits for a taxid with "no rank" and when a very small value was used for
-k
, Centrifuge seems to erroneously classify the sequence as taxid 1 (root).