DerrickWood / kraken

Kraken taxonomic sequence classification system
http://ccb.jhu.edu/software/kraken/
GNU General Public License v3.0
212 stars 104 forks source link

Assinging taxonomy from NCBI genomes downloads #39

Closed EricSalomaki closed 7 years ago

EricSalomaki commented 8 years ago

I recently downloaded all genomes available on ncbi in hopes of making a database that would be as comprehensive as possible for determining taxonomy for a largely unknown eukaryote/prokaryote metagenomic dataset. However, it appears that there are currently no GI number assigned to the files downloaded from ncbi and the headers look like this :

NC_008801.1 Monodelphis domestica chromosome 1, MonDom5, whole genome shotgun sequence AtcctcccccccaccaccaccccagcATGCAGGCCGCCACCATCTTATCCACCAGGCCGCCCCGGTGCGTGGC

rather than

gi|701219395|ref|NC_025403.1| Achimota virus 1, complete genome ACCAGAGGGAAAATATAACAATGTCGTTTTATAGCGATGTAAATAATACTTATGTAGGCCCGAAAGTGC

I noticed for a previous issue that you mentioned that you are working on a better solution that will allow inclusion of sequences that lack GI numbers but have only accession numbers so hopefully that can help fix this issue down the road. However, I was wondering if you had any ideas for a workaround for this issue at this point in time that may make it easier than trying to individually assign taxonomy for the 60,000+ genomes that I am trying to make into a database. Any thoughts are greatly appreciated.

Best, Eric

tseemann commented 7 years ago

NCBI is phasing out GI numbers already. Only accession numbers will be supported. Also, only new species will get a taxid - new "strains" will not.

This will require re-engineering of the Kraken code base, or moving to a new software.

jenniferlu717 commented 7 years ago

The code has been updated to account for the phasing out of GI numbers. If the updated code has any additional issues, please open a new issue.