Closed bachev closed 6 years ago
Hi Bastien,
It appears that particular sequence does not have a taxonomy ID associated with its GI number in the taxonomy data that Kraken downloads. As it is now, the lack of a taxonomy ID for that sequence is causing exact matches with the sequence to not be classified by Kraken. Now that I think about it, I probably should have Kraken associate these types of sequences with taxid 1 (root), not taxid 0 (unclassified). I will probably do that for the next release.
This particular sequence does seem to have a taxonomy ID (645657) when looking at the actual .gbk file, but kraken-build
doesn't download these in order to save download time and storage space. I'm not sure why NCBI hasn't associated this sequence's GI number with that taxonomy number in the gi_taxid_nucl.dmp
file that Kraken uses. I'll have to look into that a bit more to see if there's a more preferred way to associate sequences with taxonomy information.
Derrick
Ah, things get clearer. Indeed, the GI in that FASTA file has no taxonomy node ID associated with it. Funny thing, the other .fna file in the same directory does have a GI with an association. Probably a bug in the NCBI db or FASTA building process. Would you report it (you may have a bit more weight than I do)?
Regarding Kraken association: I think the classes important for a user could be a) sequence associated to a classified organism b) sequence associated to an unclassified organism c) sequence not associated
In a sense the output is a bit misleading as Kraken currently has only categories a) and c). Having this b) category would help to distinguish cases like the above "your sequence is known, but I have no way to tell you from which organism." Keeping that in taxid 0 would be the right way to do it, because having them in taxid 1 normally means "sheesh, present in all kingdoms of life".
I suppose I'd keep track of sequences which have absolutely no hit with taxid -1 (or however you want to implement this) and output that as "sequence not associated".
B.
I've notified NCBI about this particular issue - we'll see how things go there.
As for the proper classification of these kinds of sequences, I'll have to give it a bit more thought if the taxid 1 solution isn't appropriate. Perhaps a different classification code in the first column of the output (e.g., adding O in addition to C/U) while maintaining the taxid 0 behavior would be best. There's also a taxid 12908 ("unclassified sequences"), but I don't know that I feel comfortable hard-coding a non-0/1 ID into the main Kraken code, especially when that particular ID might be changed by a future NCBI decision.
Hi Derrick,
I was testing Kraken on some sequences when I noted that Kraken completely misses some sequences which have a 100% identity match in the database.
Version: zip file master branch yesterday (18.05.2015) Database building: standard (as shown in the manual), full bacteria, no database shrinkage
Example:
to which kraken says
whereas performing a simple fasta36 search confirms that the data should be in the database:
I first suspected that that this is a bug with organisms containing multiple FASTA files (Natto has 2 .fna files), but a quick check with another bacterium having 6 .fna files found the sequence of all 6 test cases.
Any idea what's causing this bug?
Best, Bastien