DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
733 stars 274 forks source link

Processing of Data Slow #893

Open jhnath21 opened 2 days ago

jhnath21 commented 2 days ago

As data amounts have been increasing in size from new chemistry and instrumentation and the reference databases have increased in size, the processing of data has become very slow. We have tried using various number of threads to process the data. It has not been helpful with processing the data (we have tried 96, 128, 192). Also, the larger the datasets have become the more memory the processing computer needs.

Is there a way to speed up the analysis that we have not seen and a way to not require large amounts of RAM with these larger datasets? For example a file containing >20M reads takes 4+ days to process where in the past it would only take ~6 hrs (~5M reads/hr with just 16 threads). Currently we can't use a server with less then 512 GB RAM.

salzberg commented 2 days ago

You can use KrakenUniq with the new low-memory option, and then you can run on a server with any amount of memory, even just 16 GB. There's a time penalty but it's not bad. Read our short paper about it, https://pubmed.ncbi.nlm.nih.gov/37602140/