Silva and Greengenes support

rgiannico commented 9 years ago

Hi Derrick, We talked about this topic during 2014 and I know you are working on it. I'm very excited to see you released a new kraken version! I really think kraken could be one of the best solutions for metabarcoding/metagenomics analysis. Are you planning to release a guide or a script to create a kraken database from Silva, Greengenes or other custom databases?

DerrickWood commented 9 years ago

Hi Riccardo,

I am indeed planning to add an option to kraken-build that would set up at least a Greengenes database, and probably Silva as well. 16S DB support is on the TODO list on my office whiteboard. :)

I'll see what I can do with this today - I think the original test DBs were lost due to some HD failures, but I don't think it's too difficult to redo that work, especially with the latest Kraken's support for direct assignment of taxa to sequences.

igra666 commented 8 years ago

Hi Derrick, I'm also looking for an option to use SILVA database with Kraken. Did you make it available anywhere?? That would be great! Regards

pbuendia commented 8 years ago

Hi, I too would like to have access to a Greengenes database to use with Kraken. Will there be one soon? Pat

bengouts commented 6 years ago

Hi Jennifer, Hi Derrick , I am doing taxonomic classification on 16S data so it seems stupid to use a RefSeq-based Kraken DB. Is the '16s-dev' branch that you started in 2014/15 to build Kraken DB from Greengenes, Silva or RDP, ready to be used ? I guess the answer is no, otherwise I don't get why you did'nt merge it to Master and why this options are not available in the recent releases. Do you plan to work on that in the future ? I understand you guys are more interested in dealing with shotgun data but your algo is also very exciting for padawans working on 16s data ! Best regards, Benoit

your-highness commented 6 years ago

Dear @BenoitGoutorbe ,

We are also using kmer matching for 16s V3-4-based classification in our lab. We use the NCBI curated RefSeq Targeted Loci project FASTA files from https://www.ncbi.nlm.nih.gov/refseq/targetedloci/ and build a Kraken DB out of these. Idealy, you restrict the FASTA files to your amplicon sequences.

All the best

rfm-targa commented 6 years ago

Hello everyone,

It's possible to build Greengenes and SILVA databases for Kraken but in order to maintain their original taxonomy I had to create custom names.dmp and nodes.dmp files for each of those 16S databases. The header of the sequences also has to be formatted as explained in the Kraken manual so that it is easier to build the DB without problems. I have a repository with the process to build a Greengenes 13.5 database (full file). While the steps I describe in the repository should work just fine, I would like to update the repository in a near future to include a faster process that also works for Greengenes 13.8 and SILVA. I currently have both databases for Kraken. Check the repo if you want, it might help if you really want to adapt those databases. Anyway, just as @your-highness said, the NCBI Targeted Loci project is also a good option. I've also used those sequences with Kraken and they give good results. It's a small file with around 20K sequences but they are all annotated to species rank and include a lot of species. Greengenes and SILVA aren't good if you want to classify at species level (SILVA sequences are annotated maximum at genus level, anything at species level isn't really 'correct' and Greengenes hasn't been updated in a long time, only having around 637 species represented).

Best regards

bengouts commented 6 years ago

Thanks a lot for this precious help. I was not aware of this targeted loci database from refseq and it's exactly what I needed. I built my kraken database from it within 10 minutes (8 Threads - 64GB of RAM) and it classifies my reads very well (about 99.5% to the phylum level and 80% to the species level for the few samples I've tried so far) at very high speed (a few seconds for 150k reads of 500 bp each). I think I will stick with this solution because of the issues you (@rfm-targa) mentioned about GreenGenes, Silva and RDB (I need as much information as possible at genus/species levels). Again, thanks a lot !

your-highness commented 6 years ago

I have a question out of curiousity to @BenoitGoutorbe and @rfm-targa 👍

When building a Kraken database for amplicon sequencing strategies, do you restrict your reference sequences (i.e. NCBI Targeted Loci project) to your amplified regions exclusively? What is your opinion on reducing the references to e.g. V3 if your primers target only V3?

rfm-targa commented 6 years ago

@your-highness Personally, I don't limit my database to the targeted region. Due to the way Kraken works, I don't think limiting the database will improve performance. The database with full 16S sequences should contain the k-mers for the region of interest and classify just as well or better. If we limit the database to the targeted region we might create some problems like:

We have a database with only the targeted region but the sequences used for constructing the database were obtained with certain primers or with a certain software that extracts sequences from full 16S sequences. It will be difficult to have sequences to classify that only spawn the exact same region as in our database and because of that we will get wrong results. It's difficult to have just the targeted region and to get only that targeted region when using primers so in my opinion using more than just the targeted region is a plus since with the full sequence you can find the full regions and classify based on all information without getting wrong hits because one region was slightly shorter or longer.
In my opinion, a database with full sequences is better and using more than one region or a longer target is obviously better. In the case of the 16S rRNA, including variable regions and parts of 'constant' regions might help even more, since there are different species with variable regions with the exact same sequence and including more info from the 'constant' regions might help ('A systematic search for discriminating sites in the 16S ribosomal RNA gene' by Hilde et al. might be interesting).
Creating a database from one targeted region will only work for that region and it might be more practical to have a database that can be used more broadly.
A database based on a target like full 16S will not take much disk space and will run well in a laptop with 16Gb (might work well with less, didn't test). Reducing the target to only a variable region will have a small impact in computing requirements. I even run Kraken with a 16S database made from a filtered SILVA 132 (around 2Gb) in a 16Gb machine and it's fine. In this case, RAM requirements are mainly due to Kraken indexing structure that takes a fixed amount of space (increasing after that as you had more and more sequences).

I've classified with full 16S databases and the results weren't bad since there's a lot of reference sequences. One problem that one can't really solve is the fact that at species level, different species might have the exact same 16S sequence or 16S region, that there might be multiple copies of 16S in the same bacteria and that those copies are not identical.

Well, this is just my opinion about some things, hope it helps.

DerrickWood / kraken

Silva and Greengenes support #14