DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

No taxid when query assigned to a specific seqID (entry is in the taxonomy) #218

Closed Piplopp closed 2 years ago

Piplopp commented 2 years ago

Hello !

I'm trying to index the SILVA database for centrifuge. I built the acc2taxid map, the nodes.dmp and names.dmp files just fine but I noticed some sequences were assigned a taxid of 0 when trying to classify. I had the same behavior when using those files but produced by Kraken2.

readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
6d3e5d85b2aa35d58347bb4b9b203e43 U92195.1.1541 0 60025 60025 260 260 6
6d3e5d85b2aa35d58347bb4b9b203e43 MF457876.1.1456 0 60025 60025 260 260 6
6d3e5d85b2aa35d58347bb4b9b203e43 genus 46463 60025 60025 260 260 6
6d3e5d85b2aa35d58347bb4b9b203e43 CS600365.8.1528 0 60025 60025 260 260 6
6d3e5d85b2aa35d58347bb4b9b203e43 U88546.1.1541 0 60025 60025 260 260 6
6d3e5d85b2aa35d58347bb4b9b203e43 EU775002.1.1308 0 60025 60025 260 260 6
- - - - - - - -
cb6ee53401962a26788af21de2a16f67 AP017610.4186840.4188378 46463 169744 0 427 427 1

As you can see, for the readID: 6d3e5d85b2aa35d58347bb4b9b203e43 a lot of the matches have a seqID assigned but the taxID is 0. I double checked both the centrifuge-build command (no missing taxonomy id in the output) and the various files and all seems to be fine.

The query readID: cb6ee53401962a26788af21de2a16f67 at the end is behaving as expected, the seqID does have it's expected taxid.

For instance for the seqID U92195.1.1541 and MF457876.1.1456

acc2taxid:

U92195.1.1541   46474
MF457876.1.1456 46465

nodes.dmp

46474   |   46454   |   genus   |   -   |
46465   |   46454   |   genus   |   -   |

names.dmp

46474   |   Salmonella  |   -   |   scientific name |
46465   |   Klebsiella  |   -   |   scientific name |

And from centrifuge-inspect:

>U88546.1.1541 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica subsp. enterica serovar Paratyphi A
>MF457876.1.1456 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Escherichia coli

I also tried to replace the dots in the sequence ids by '_' just in case but the result is the same. I would expect a taxid of 0 if the sequence was not found in the taxonomy or if the seqID was actually a LCA like the third assignment to 'genus', but in those case, my query was assigned to a specific seqID and thus I would expect to find the related taxID.

Maybe there's something I don't fully understand, but in any case if you have any idea of what can be happening or why :)

Thanks a lot !

cjalder commented 2 years ago

Have you had any luck with this? I have a similar problem, however mine does report errors of missing taxonomy ID, even though they exist in the acc2id.map, nodes.dmp and names.dmp files

sanderdebacker commented 2 years ago

Same problem here, with a custom built database. SeqID gets assigned, but without taxID, even though its present in the .map and .dmp files. Haven't found a solution or explanation yet.

Piplopp commented 2 years ago

I noticed that if you set the -k option high enough the taxids appears even for the SeqIDs that were previously problematic (in my case I tried with -k 1000 but I did not manage to understand what's happening so far

cjalder commented 2 years ago

When inspecting my .map file, I noticed a number of formatting issues within it, such as line breaks and merges within a taxID/seqID (A quick way to check for me was too see if any lines started with a number). I was able to resolve my issue by creating a .map file manually. Hope it helps anyone out there!

sanderdebacker commented 2 years ago

Did some steps again and everything seems the same, but the problem has been resolved.

Either one of them, or a combination caused the problem to be resolved.

Piplopp commented 2 years ago

Did the exact same thing and the problem has been resolved. What happened is still a mystery