PROBIC / mSWEEP

mSWEEP High-resolution sweep metagenomics using fast probabilistic inference
MIT License
13 stars 2 forks source link

Optimizing mSWEEP runs on large datasets #29

Closed EnriqueDoster closed 2 months ago

EnriqueDoster commented 2 months ago

Hello mSWEEP developers,

I'm having a hard time running mSWEEP on our samples consisting of an average of 30 million paired reads per sample. The themisto index consists of 2137 reference genomes and therefore our alignments are quite large and I'm wondering how best to optimize the mSWEEP run.

First I tried using default settings, using a full node and 48 threads, but the process has not finished in almost three days. In the meantime, I tried using the --min-hits, --max-iters, and --tol to varying success and I'm hoping to get your opinion on what combination of flags to use.

Here's what I tried:

Do you have recommendations on which combination of flags could help me improve run time without greatly influencing the results?

Thank you in advance for your time, Enrique

tmaklin commented 2 months ago

Hi Enrique, that sounds a little odd to me, 2137 genomes shouldn't be impossibly large as I have run the method successfully on much larger reference sets.

Can you give me a little more details such as:

The different flags you tried can also be combined. --min-hits values beyond --min-hits 1 probably aren't that useful if the references are all from the same species because the reads tend to hit most of the refs.

Needing high values of --tol can also indicate that either the references are not a good match for the sample, or the clustering supplied via -i is in the wrong order, as both can cause the estimation to get stuck since there is no correct solution to converge to. v2.1.0 should automatically increase the tolerance if numerical accuracy is causing issues.

thanks!

EnriqueDoster commented 2 months ago

Hi, thanks for the quick response.

Turns out, as usual, it was user error!

So, I only ran mSWEEP on a small test sample and everything looked good, so I jumped to wrapping the commands in a nextflow module. However, it seems like using nextflow was causing the problem because it was not correctly distributing the processes across multiple nodes so the jobs just never finished. I tried using parallel as well to run multiple processes, but that didn't seem to work either. I finally tried switching to a simple script with 1 command per line and that worked well.

Not sure how I messed up both nextflow and parallel runs, but it was not mSWEEP's fault at all!

Thanks for you help, Enrique

tmaklin commented 2 months ago

Glad to hear that everything worked out!