fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
66 stars 7 forks source link

Computational resources with `nanodisco difference` #18

Closed mtisza1 closed 3 years ago

mtisza1 commented 3 years ago

Hi,

First, thanks for the interesting tool!

I'm having a little trouble running nanodisco difference in the most effective way. The parameters -nj, -p, -nc, and possibly how many chunks are subset (-f and -l) seem to all effect how resources are used.

Let's say I'm using an HPC with 64 CPUs and 64 GB of memory, and I have a reference genome with 1200 chunks. Using settings that I thought would be sensible, -nj 16 -p 2 -nc 2 -f 1 -l 500, nanodisco tried to request way more CPUs and memory than were available, and then my HPC got mad and canceled the job.

Is there some sort of formula or recommended settings for nanodisco difference based on available computational resources. Ideally, I wouldn't subset the genome with -f and -l. (I could also request more resources, if that would be helpful).

Regards,

Mike

touala commented 3 years ago

Hi Mike,

Thank you for your interest.

Those parameters (-nj, -p, and -nc) are the main one to affect the resource consumption. However, the memory usage is also impacted by the genomic coverage. For example, 2 jobs (-nj 2 -p 2 -nc 2) of 85x coverage datasets used >10Gb. In your case, I would expect the memory to be the main limiting factor (>80Gb would be used for 16 jobs) but I'll investigate the CPU issue you mentioned. Have any current differences files been generated?

Unfortunately, I don't have a perfect formula for all situations (uneven coverage makes this difficult). I would recommend for you to reduce the number of parallel jobs (-nj) and increase the number of threads per job used (-p). In practice, I would run the command once and request a single job while monitoring resource usage: -nj 1 -p 5 -nc 5 -f 100 -l 105 (avoid processing contig ends where coverage could be reduced). With this information, you can estimate how much scaling up can be done while keeping a safety margin.

If you have access to an HPC, using the job scheduler (like LSF) could be a good solution instead of relying on the built-in parallel approach (see code/difference.sh). With this solution, you can spawn a job for each subset of chunks as described above (e.g. -nj 1 -p 5 -nc 5 -f 100 -l 105) and use a loop to generate all chunk start/end combinations (e.g. 1 to 5, then 6 to 10, etc.). This can be easier to manage once resource requests are tuned, and it makes a better use of available HPC resources.

Hope these help you overcome the problem, but please feel free to reach back if you face any issue.

Alan

mtisza1 commented 3 years ago

OK, this is making a lot more sense now. Thanks for explaining it so carefully. My mind didn't immediately jump to scheduling a bunch of jobs for a single genome/metagenome, but, since nanodisco merge is used downstream, it really shouldn't matter if the files were generated by multiple jobs.

Perhaps this would be an easy inference for other users, but I would have personally benefitted if something along these lines was in the "Detailed Tutorial" or "Command Details" in the documentation.

(FYI I was able to generate the files and conduct downstream analyses by just using a handful of chunks, so no issues running the scripts otherwise)

I'll close this issue now. Cheers.

touala commented 3 years ago

Great that you were able to conduct downstream analyses anyway. And thank you very much for the feedback, we have revised the documentation accordingly.

Alan