Optimizing mSWEEP runs on large datasets

EnriqueDoster commented 2 months ago

Hello mSWEEP developers,

I'm having a hard time running mSWEEP on our samples consisting of an average of 30 million paired reads per sample. The themisto index consists of 2137 reference genomes and therefore our alignments are quite large and I'm wondering how best to optimize the mSWEEP run.

First I tried using default settings, using a full node and 48 threads, but the process has not finished in almost three days. In the meantime, I tried using the --min-hits, --max-iters, and --tol to varying success and I'm hoping to get your opinion on what combination of flags to use.

Here's what I tried:

The --max-iters flag didn't seem to make much of a difference, even down to 100.
The --min-hits flag worked well to get results within a few hours, but I used an extreme value of 1,000,000. So that might have been too stringent.
Lastly, picking the extreme value of --tol 0.1 also got me results within a few hours and showed similar results as the --min-hits flag, however, I have no idea how to choose the best value.

Do you have recommendations on which combination of flags could help me improve run time without greatly influencing the results?

Thank you in advance for your time, Enrique

tmaklin commented 2 months ago

Hi Enrique, that sounds a little odd to me, 2137 genomes shouldn't be impossibly large as I have run the method successfully on much larger reference sets.

Can you give me a little more details such as:

How big is the alignment file on disk? Did you compress it with eg. gzip oralignment-writer?
Did you use the precompiled binaries or compile msweep yourself?
Are your references from the same species or different species?
Are the samples expected to be very diverse?
If you have the log file from the run that would be helpful too but not necessary.

The different flags you tried can also be combined. --min-hits values beyond --min-hits 1 probably aren't that useful if the references are all from the same species because the reads tend to hit most of the refs.

Needing high values of --tol can also indicate that either the references are not a good match for the sample, or the clustering supplied via -i is in the wrong order, as both can cause the estimation to get stuck since there is no correct solution to converge to. v2.1.0 should automatically increase the tolerance if numerical accuracy is causing issues.

thanks!

EnriqueDoster commented 2 months ago

Hi, thanks for the quick response.

Turns out, as usual, it was user error!

So, I only ran mSWEEP on a small test sample and everything looked good, so I jumped to wrapping the commands in a nextflow module. However, it seems like using nextflow was causing the problem because it was not correctly distributing the processes across multiple nodes so the jobs just never finished. I tried using parallel as well to run multiple processes, but that didn't seem to work either. I finally tried switching to a simple script with 1 command per line and that worked well.

Not sure how I messed up both nextflow and parallel runs, but it was not mSWEEP's fault at all!

Thanks for you help, Enrique

tmaklin commented 2 months ago

Glad to hear that everything worked out!

PROBIC / mSWEEP

Optimizing mSWEEP runs on large datasets #29