Trying to run the nucleotide nt Database across multiple nodes on my University's hpc cluster

DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system

MIT License

683 stars 266 forks source link

Trying to run the nucleotide nt Database across multiple nodes on my University's hpc cluster #827

Open JDMirza opened 2 months ago

JDMirza commented 2 months ago

My University's HPC cluster allows me to run the Kraken2 databases upto 490GB of RAM per node using SLURM. I've tested trying to run the nt database across multiple nodes to allow for the 720 GB of RAM to be allocated however I have not had any success as it fails to allocate the memory. For Kraken2 is there a way to configure this so the hash.k2d table can be loaded across multiple nodes to allocate sufficient RAM, or would this not be possible?

Ayala-Ruan-CesarM commented 2 months ago

Hi!. Though I am not a kraken2 developer, I find that problem facinating. ¿Have you tried to split the hash.k2d table and assigend each split part to a single node? Also, a not so elegant solution could be to build your own database in batches so you can run it accordingly to your ram limitations.

JDMirza commented 2 months ago

Building split databases would be a possibility, though admittedly I normally just use wget to download the standard k2 databases then extract them as it is a lot less time consuming than building them. I'll look into building the database to see if I can get this partioned without it affecting read assignment. Am also curious if there is a way to modify the .k2d files directly, I know I can inspect them with the kraken2-inspect script but beyond that I'm a bit stuck at the moment. Will keep going through the documentation and update if I can rebuild this as a split database.