DerrickWood / kraken

Kraken taxonomic sequence classification system
http://ccb.jhu.edu/software/kraken/
GNU General Public License v3.0
214 stars 103 forks source link

Silva and Greengenes support #14

Open rgiannico opened 9 years ago

rgiannico commented 9 years ago

Hi Derrick, We talked about this topic during 2014 and I know you are working on it. I'm very excited to see you released a new kraken version! I really think kraken could be one of the best solutions for metabarcoding/metagenomics analysis. Are you planning to release a guide or a script to create a kraken database from Silva, Greengenes or other custom databases?

DerrickWood commented 9 years ago

Hi Riccardo,

I am indeed planning to add an option to kraken-build that would set up at least a Greengenes database, and probably Silva as well. 16S DB support is on the TODO list on my office whiteboard. :)

I'll see what I can do with this today - I think the original test DBs were lost due to some HD failures, but I don't think it's too difficult to redo that work, especially with the latest Kraken's support for direct assignment of taxa to sequences.

igra666 commented 8 years ago

Hi Derrick, I'm also looking for an option to use SILVA database with Kraken. Did you make it available anywhere?? That would be great! Regards

pbuendia commented 8 years ago

Hi, I too would like to have access to a Greengenes database to use with Kraken. Will there be one soon? Pat

bengouts commented 6 years ago

Hi Jennifer, Hi Derrick , I am doing taxonomic classification on 16S data so it seems stupid to use a RefSeq-based Kraken DB. Is the '16s-dev' branch that you started in 2014/15 to build Kraken DB from Greengenes, Silva or RDP, ready to be used ? I guess the answer is no, otherwise I don't get why you did'nt merge it to Master and why this options are not available in the recent releases. Do you plan to work on that in the future ? I understand you guys are more interested in dealing with shotgun data but your algo is also very exciting for padawans working on 16s data ! Best regards, Benoit

your-highness commented 6 years ago

Dear @BenoitGoutorbe ,

We are also using kmer matching for 16s V3-4-based classification in our lab. We use the NCBI curated RefSeq Targeted Loci project FASTA files from https://www.ncbi.nlm.nih.gov/refseq/targetedloci/ and build a Kraken DB out of these. Idealy, you restrict the FASTA files to your amplicon sequences.

All the best

rfm-targa commented 6 years ago

Hello everyone,

It's possible to build Greengenes and SILVA databases for Kraken but in order to maintain their original taxonomy I had to create custom names.dmp and nodes.dmp files for each of those 16S databases. The header of the sequences also has to be formatted as explained in the Kraken manual so that it is easier to build the DB without problems. I have a repository with the process to build a Greengenes 13.5 database (full file). While the steps I describe in the repository should work just fine, I would like to update the repository in a near future to include a faster process that also works for Greengenes 13.8 and SILVA. I currently have both databases for Kraken. Check the repo if you want, it might help if you really want to adapt those databases. Anyway, just as @your-highness said, the NCBI Targeted Loci project is also a good option. I've also used those sequences with Kraken and they give good results. It's a small file with around 20K sequences but they are all annotated to species rank and include a lot of species. Greengenes and SILVA aren't good if you want to classify at species level (SILVA sequences are annotated maximum at genus level, anything at species level isn't really 'correct' and Greengenes hasn't been updated in a long time, only having around 637 species represented).

Best regards

bengouts commented 6 years ago

Thanks a lot for this precious help. I was not aware of this targeted loci database from refseq and it's exactly what I needed. I built my kraken database from it within 10 minutes (8 Threads - 64GB of RAM) and it classifies my reads very well (about 99.5% to the phylum level and 80% to the species level for the few samples I've tried so far) at very high speed (a few seconds for 150k reads of 500 bp each). I think I will stick with this solution because of the issues you (@rfm-targa) mentioned about GreenGenes, Silva and RDB (I need as much information as possible at genus/species levels). Again, thanks a lot !

your-highness commented 6 years ago

I have a question out of curiousity to @BenoitGoutorbe and @rfm-targa 👍

When building a Kraken database for amplicon sequencing strategies, do you restrict your reference sequences (i.e. NCBI Targeted Loci project) to your amplified regions exclusively? What is your opinion on reducing the references to e.g. V3 if your primers target only V3?

rfm-targa commented 6 years ago

@your-highness Personally, I don't limit my database to the targeted region. Due to the way Kraken works, I don't think limiting the database will improve performance. The database with full 16S sequences should contain the k-mers for the region of interest and classify just as well or better. If we limit the database to the targeted region we might create some problems like:

I've classified with full 16S databases and the results weren't bad since there's a lot of reference sequences. One problem that one can't really solve is the fact that at species level, different species might have the exact same 16S sequence or 16S region, that there might be multiple copies of 16S in the same bacteria and that those copies are not identical.

Well, this is just my opinion about some things, hope it helps.