DerrickWood / kraken

Kraken taxonomic sequence classification system
http://ccb.jhu.edu/software/kraken/
GNU General Public License v3.0
212 stars 104 forks source link

Kraken does not classify some 100% matching sequences #20

Closed bachev closed 6 years ago

bachev commented 9 years ago

Hi Derrick,

I was testing Kraken on some sequences when I noted that Kraken completely misses some sequences which have a 100% identity match in the database.

Version: zip file master branch yesterday (18.05.2015) Database building: standard (as shown in the manual), full bacteria, no database shrinkage

Example:

>from_BacSu_Natto_195
CTTGAATGGATGCAGCAACTTGAGTGTCAGTAGACGTAACATCAATGTCGCAAGAATCTC
TAACGATAATCATTTCTTCAGATGTTTGTTTGTTAAGGCTTAGCTGATCAAAATCTTGAA
ATGCGTTAGCATCTTCTCCTCCAGAACAAACGTCATGACAAACCGCTTCTCTTTTATAGC
CGCTACCCGGGTGTGTACATGAATTATCTAGGGCAACCCATGAGTAAGGTCTTGATTCCA
TTTGTTTTCCTCCTTTCGTTTACTTTAGTAGATGATGACAGTTCATATTTTGTATAAGCA
AATATCATGGTTTTAGATACCTATTTTAAATATTATCGTTTTTTCTTTGACGTACGCGAT
CTCTCAGTGTTGTTCGACGTCTTTTTTGCGGCGCTGCTTCTTGTTCTGCTTTTTCAGCAT
CTGCCTTTTTCTTATTGACTCTTTCAACAAATTCGTTAGCCTGTTTTTGAAGGTCATTTA
CCATTGTAATAATCGCTTGTTTTTGAGCTTCTGATAAGTTGCGTTTTTTATCTTCTGTTA
CATTGTTTTTGTTCAGTACATCTGAAACAAGAAGTTTCGTTAGGGGGTCGATATTTGCTT
CTAGTTCTTCTTTGTTTTTTTGCAGTGCATCTTTTTCTTCGGACATGTTATTCACCTCGA
CTTCTATAAAATTAGAAAGAAAGGGCTACAGAATGTCAGCTTCTGTTATAAGAGCGATTA
ATGTTTGGGTTAAGAGCTGAACGGATGCCATGATGTCTGTAGCTGATAAAGTGATTGTGA
TATTATAGCAGTTTATAATTTTGATTTTTACTTTCTTTCGGCTATGTGCTGTAGAGCGTG
CTATCAGATCACTCGCAACAATGTCTGCTATATCGCTATCGCTGATAAGAAGCTGGGTTG
TAACAGTGATTAAAGCTGAAACTGCTGATTGCAATGAAAGTGCAAATGTGGTGTCTGATT

to which kraken says

$ kraken --db /scratch1/tmp/bachtmp/krakentest/bacteria bug.fasta >kraoutbug.txt
1 sequences (0.00 Mbp) processed in 0.001s (76.1 Kseq/m, 73.10 Mbp/m).
  0 sequences classified (0.00%)
  1 sequences unclassified (100.00%)

whereas performing a simple fasta36 search confirms that the data should be in the database:

# fasta36 bug.fasta ../krakentest/bacteria/library/Bacteria/Bacillus_subtilis_natto_BEST195_uid183001/NC_017196.fna
FASTA searches a protein or DNA sequence data bank
 version 36.3.4 May, 2011
Please cite:
 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Query: bug.fasta
  1>>>from_BacSu_Natto_195 - 960 nt
Library: ../krakentest/bacteria/library/Bacteria/Bacillus_subtilis_natto_BEST195_uid183001/NC_017196.fna
  4091591 residues in     1 sequences

The best scores are:                                      opt bits E(28)
gi|428277412|ref|NC_017196.1| Bacillus subtili (4091591) [f] 4800 523.7 8.6e-149
...
>>gi|428277412|ref|NC_017196.1| Bacillus subtilis subsp.  (4091591 nt)
 initn: 4800 init1: 4800 opt: 4800  Z-score: 2729.5  bits: 523.7 E(28): 8.6e-149
banded Smith-Waterman score: 4800; 100.0% identity (100.0% similar) in 960 nt overlap (1-960:1240841-1241800)

                                             10        20        30
from_B                               CTTGAATGGATGCAGCAACTTGAGTGTCAG
                                     ::::::::::::::::::::::::::::::
gi|428 CAATTGTAATGACAGCAGTTTGGAGGGCGGCTTGAATGGATGCAGCAACTTGAGTGTCAG
          1240820   1240830   1240840   1240850   1240860   1240870

               40        50        60        70        80        90
from_B TAGACGTAACATCAATGTCGCAAGAATCTCTAACGATAATCATTTCTTCAGATGTTTGTT
       ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|428 TAGACGTAACATCAATGTCGCAAGAATCTCTAACGATAATCATTTCTTCAGATGTTTGTT
          1240880   1240890   1240900   1240910   1240920   1240930

I first suspected that that this is a bug with organisms containing multiple FASTA files (Natto has 2 .fna files), but a quick check with another bacterium having 6 .fna files found the sequence of all 6 test cases.

Any idea what's causing this bug?

Best, Bastien

DerrickWood commented 9 years ago

Hi Bastien,

It appears that particular sequence does not have a taxonomy ID associated with its GI number in the taxonomy data that Kraken downloads. As it is now, the lack of a taxonomy ID for that sequence is causing exact matches with the sequence to not be classified by Kraken. Now that I think about it, I probably should have Kraken associate these types of sequences with taxid 1 (root), not taxid 0 (unclassified). I will probably do that for the next release.

This particular sequence does seem to have a taxonomy ID (645657) when looking at the actual .gbk file, but kraken-build doesn't download these in order to save download time and storage space. I'm not sure why NCBI hasn't associated this sequence's GI number with that taxonomy number in the gi_taxid_nucl.dmp file that Kraken uses. I'll have to look into that a bit more to see if there's a more preferred way to associate sequences with taxonomy information.

Derrick

bachev commented 9 years ago

Ah, things get clearer. Indeed, the GI in that FASTA file has no taxonomy node ID associated with it. Funny thing, the other .fna file in the same directory does have a GI with an association. Probably a bug in the NCBI db or FASTA building process. Would you report it (you may have a bit more weight than I do)?

Regarding Kraken association: I think the classes important for a user could be a) sequence associated to a classified organism b) sequence associated to an unclassified organism c) sequence not associated

In a sense the output is a bit misleading as Kraken currently has only categories a) and c). Having this b) category would help to distinguish cases like the above "your sequence is known, but I have no way to tell you from which organism." Keeping that in taxid 0 would be the right way to do it, because having them in taxid 1 normally means "sheesh, present in all kingdoms of life".

I suppose I'd keep track of sequences which have absolutely no hit with taxid -1 (or however you want to implement this) and output that as "sequence not associated".

B.

DerrickWood commented 9 years ago

I've notified NCBI about this particular issue - we'll see how things go there.

As for the proper classification of these kinds of sequences, I'll have to give it a bit more thought if the taxid 1 solution isn't appropriate. Perhaps a different classification code in the first column of the output (e.g., adding O in addition to C/U) while maintaining the taxid 0 behavior would be best. There's also a taxid 12908 ("unclassified sequences"), but I don't know that I feel comfortable hard-coding a non-0/1 ID into the main Kraken code, especially when that particular ID might be changed by a future NCBI decision.