Closed EnriqueDoster closed 2 months ago
Hi Enrique, that sounds a little odd to me, 2137 genomes shouldn't be impossibly large as I have run the method successfully on much larger reference sets.
Can you give me a little more details such as:
The different flags you tried can also be combined. --min-hits
values beyond --min-hits 1
probably aren't that useful if the references are all from the same species because the reads tend to hit most of the refs.
Needing high values of --tol
can also indicate that either the references are not a good match for the sample, or the clustering supplied via -i
is in the wrong order, as both can cause the estimation to get stuck since there is no correct solution to converge to. v2.1.0 should automatically increase the tolerance if numerical accuracy is causing issues.
thanks!
Hi, thanks for the quick response.
Turns out, as usual, it was user error!
So, I only ran mSWEEP on a small test sample and everything looked good, so I jumped to wrapping the commands in a nextflow module. However, it seems like using nextflow was causing the problem because it was not correctly distributing the processes across multiple nodes so the jobs just never finished. I tried using parallel as well to run multiple processes, but that didn't seem to work either. I finally tried switching to a simple script with 1 command per line and that worked well.
Not sure how I messed up both nextflow and parallel runs, but it was not mSWEEP's fault at all!
Thanks for you help, Enrique
Glad to hear that everything worked out!
Hello mSWEEP developers,
I'm having a hard time running mSWEEP on our samples consisting of an average of 30 million paired reads per sample. The themisto index consists of 2137 reference genomes and therefore our alignments are quite large and I'm wondering how best to optimize the mSWEEP run.
First I tried using default settings, using a full node and 48 threads, but the process has not finished in almost three days. In the meantime, I tried using the
--min-hits
,--max-iters
, and--tol
to varying success and I'm hoping to get your opinion on what combination of flags to use.Here's what I tried:
--max-iters
flag didn't seem to make much of a difference, even down to 100.--min-hits
flag worked well to get results within a few hours, but I used an extreme value of 1,000,000. So that might have been too stringent.--tol 0.1
also got me results within a few hours and showed similar results as the--min-hits
flag, however, I have no idea how to choose the best value.Do you have recommendations on which combination of flags could help me improve run time without greatly influencing the results?
Thank you in advance for your time, Enrique