"cactus-align --restart" repeats the same steps and does not proceed to the next steps.

jyj5558 commented 1 year ago

Hi, thanks so much for all your efforts to actively maintain and develop the cactus.

I am following Minigraph-Cactus pangenome pipeline step-by-step using 30ish bird assemblies (softmasked) on a SLURM system (I employed 128 cores, 1000G memory, 1-day walltime for each SLURM job). For the background those are assemblies constructed from low-coverage PE reads (6-8x) using SOAPdenovo2 and RagTag so there are many "N"s, or gaps, although I am not sure if those gaps caused the problem below. Recently I encountered an incomprehensible situation that the cactus-align code with "--restart" option repeated the same steps over and over for the four consecutive SLURM jobs. Due to my institution's policy I only had one-day walltime for a SLURM job so I needed to resubmit the same SLURM jobs with "--restart" option several times. For the first some times the code seemed working with continuous progress. It seemed it get stuck at the specific chromosomes. (maybe?)

The code I used is: cactus-align --batch ./jobstore ./contigs/chromfile.txt ${MYBUCKET}/align --consCores 16 --maxCores 128 --maxMemory 1000G --workDir Temp/ --logFile align.log --maxNodes 20 --nodeStorage 1000 --pangenome --maxLen 10000 --reference ${Ref} --outVG --realTimeLogging

I have attached a log file with a successful progress and log files with repeated steps. I would really appreciate if you can look into the files and provide me any guidance. Please let me know if you need any further information regarding this jobs.

Thanks, pangenome_25461669 (successful).err.txt pangenome_25486260 (not successful).err.txt pangenome_25513031 (not successful).err.txt pangenome_25544382 (not successful).err.txt

glennhickey commented 1 year ago

That's pretty strange.

About the --restart option, it restarts at the beginning of the Toil job, which in this case seems to be the run of cactus_consolidated. So if that run fails, it can only try resuming at the beginning of that command.

But looking at your first failing log, it seems you have at least three chromosomes running, all getting stuck at the same part of cactus_consolidated (beginning of the reference phase). I've seen this part take a long time for alignments of distance species in progressive Cactus, but never in the pangenome pipeline.

I guess it is possible that the Ns are related to this, since I haven't tested it on lots of data like this.

Since it's not crashing, it's hard to say it's a bug though. Maybe it'll run through fine if it had a few more hours. To that end, you might consider setting --consCores to something much higher. This will do fewer jobs in parallel, giving each more cores. This will give cactus_consolidated a better chance of finishing in under 24 hours, which would help --restart make better progress.

We're working on better Slurm support in Toil and Cactus now, so in the next release, you shuold be able to use --batchSystem slurm directly with Cactus to submit slurm jobs for your, which may also help.

jyj5558 commented 1 year ago

Thanks so much for your prompt and helpful reply! I will try running with "--consCores" of the whole cores I can assign, and if it is not working maybe I should try gap-filling before I dive into M-C pipeline. I will update here. Thanks again!

jyj5558 commented 1 year ago

The issue was solved by using whole cores for "--consCores". I appreciate your help!

ComparativeGenomicsToolkit / cactus

"cactus-align --restart" repeats the same steps and does not proceed to the next steps. #1036