Open paulzierep opened 3 months ago
The DB itself is capped at 8/16 GB. Thats why you can see the size of those dbs is limited to 8GB/16GB.
Thank you very much for the response, but could you explain how this is done technically? FYI, I have a student who investigates the performance of kraken2 DBs, and we are also looking into the effect of the capped DBs, but it would be good if we could explain what the technical difference is.
I have already requested that the scripts that are used to build these indices be shared.
https://github.com/BenLangmead/aws-indexes/issues/31
There is no response yet. If the scripts are available, we know exactly how the DB is capped.
I also build a wide variety of kraken indices and created kraken-db-builder
to speed the index building. You can take a look if you are interested.
Hey! The RAM-friendly db are indexed the same as the other full dbs. @BenLangmead &al. then subsample the resulting kmers from each genome until they fit within the variously size-constrained indices. So the more genomes included, the smaller the kmer subsample for each included genome... hope that makes sense? That's my interpretation at least, hopefully correct.
It would probably make sense to first start further reducing the input number genomes for over-represented species, but that would require some subjective choices and be way, way, way too much manual curation. Curious as to what your student turns up @paulzierep
I originally found this page when googling who to thank for these prebuilt dbs, as they've saved me a lot of effort and highmem node queuing over the last couple of years. ("This project is maintained by BenLangmead" in the corner of the project site did not initially clue me in, so I'm clearly not very brilliant.) I am thankful though - thanks db maintainers!
Could you kindly explain how the DB are capped for the kraken2 DB ? Random Subsampling of the input or the DB itself ?