ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Cactus hanging when specifying "--workDir" #1377

Closed amsession closed 2 weeks ago

amsession commented 1 month ago

I am having an issue when I need to specify a work directory to save temporary files. I had a run hang for about 10 days, before troubleshooting this even with the sample data set that runs fine without specifying a work directory in under 10 minutes. I am running this command through singularity:

singularity exec ~/LOCAL.INSTALL/cactus/cactus.img cactus ./js ./examples/evolverMammals.txt ./evolverMammals.hal --maxCores=32 --workDir /data/home/asession/LOCAL.INSTALL/cactus/cactus-bin-v2.8.1/work/

And get the following log file:

[2024-05-06T16:08:05-0400] [MainThread] [I] [toil.statsAndLogging] Enabling realtime logging in Toil [2024-05-06T16:08:05-0400] [MainThread] [I] [toil.statsAndLogging] Cactus Command: /home/cactus/cactus_env/bin/cactus ./js ./examples/evolverMammals.txt ./evolverMammals.hal --maxCores=32 --workDir /data/home/asession/LOCAL.INSTALL/cactus/cactus-bin-v2.8.1/work/ [2024-05-06T16:08:05-0400] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: d6fa7c4832892676ff2d1d7e36b0dcf7b6819504 [2024-05-06T16:08:05-0400] [MainThread] [I] [toil.statsAndLogging] Tree: ((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0; [2024-05-06T16:08:05-0400] [MainThread] [I] [toil.statsAndLogging] Importing https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simCow.chr6 [2024-05-06T16:08:06-0400] [MainThread] [I] [toil.statsAndLogging] Importing https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simDog.chr6 [2024-05-06T16:08:06-0400] [MainThread] [I] [toil.statsAndLogging] Importing https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simHuman.chr6 [2024-05-06T16:08:06-0400] [MainThread] [I] [toil.statsAndLogging] Importing https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simMouse.chr6 [2024-05-06T16:08:07-0400] [MainThread] [I] [toil.statsAndLogging] Importing https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simRat.chr6 [2024-05-06T16:08:07-0400] [MainThread] [I] [toil.job] Saving graph of 1 jobs, 1 non-service, 1 new [2024-05-06T16:08:07-0400] [MainThread] [I] [toil.job] Processing job 'progressive_workflow' kind-progressive_workflow/instance-992ahvuw v0 [2024-05-06T16:08:07-0400] [MainThread] [I] [toil] Running Toil version 5.12.0-6d5a5b83b649cd8adf34a5cfe89e7690c95189d3 on host compute171. [2024-05-06T16:08:07-0400] [MainThread] [I] [toil.realtimeLogger] Starting real-time logging. [2024-05-06T16:08:07-0400] [MainThread] [I] [toil.leader] Issued job 'progressive_workflow' kind-progressive_workflow/instance-992ahvuw v1 with job batch system ID: 1 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False [2024-05-06T16:08:07-0400] [MainThread] [W] [toil.common] XDG_RUNTIME_DIR is set to nonexistent directory /run/user/1306; your environment may be out of spec! [2024-05-06T16:08:07-0400] [MainThread] [I] [toil.worker] Redirecting logging to /data/home/asession/LOCAL.INSTALL/cactus/cactus-bin-v2.8.1/work/2a15b3bb791d519bb8fd779aa634a394/026f/worker_log.txt [2024-05-06T16:08:09-0400] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run

In the older job that ran for a week, I just got the "1 jobs are running" message every hour with no output. At this point in the log file for the local run that succeeds there are 0 jobs running and it jumps straight to the sanitize_fasta_header step. I also tried without the final '/' in the absolute path, but got the same error. Is there something wrong with my syntax or is there a bug?

amsession commented 1 month ago

Following up to say this problem also persists if setting TMPDIR variable as well for some reason

glennhickey commented 1 month ago

I don't think this has anything to do with specifying your work directory. It looks like it's failing to execute even a single job. You can try with --logDebug to maybe get a little more information. I just tried

mkdir work
singularity exec -B $(pwd):/data docker://quay.io/comparative-genomics-toolkit/cactus:v2.8.1 cactus ./js ./examples/evolverMammals.txt ./evolverMammals.hal --maxCores=32 --workDir /data/work/

here and it ran fine.

amsession commented 1 month ago

Thanks for the quick reply. Sorry for not being clear, but I meant to say that this problem only shows up if I specify the work directory or try to set the TMPDIR environment variable. If I run this command without that, it works fine while writing files to /tmp/. Cactus fails when I try to launch a bigger job, and suggests specifying the work directory, but then doesn't launch any jobs as you say above. I re-ran with --logDebug and attach that log file here:

NoSpecifyLog.txt

The final bit is why I thought I tried specifying the workDir was necessary:

toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'sanitize_fasta_header' kind-sanitize_fasta_header/instance-b3b8e9b2 v1 is requesting 21062132315 bytes of disk for temporary space, more than the maximum of 12790816768 bytes of disk that SingleMachineBatchSystem was configured with, or enforced by --maxDisk. Try setting/changing the toil option "--workDir" or changing the base temporary directory by setting TMPDIR. Scale is set to 1.0.

I tried running the command as you have it above and hit the same problem. Attaching that log file below.

SpecifyLog.txt

From what I can tell I don't have a hard limit to how much I can write to /tmp/ that is relevant here (actual limit is 4TB much higher than listed by the log message). This makes me think the limit must be specified by singularity/cactus. Not sure if this could be another route to getting it to work but figured I'd ask.

glennhickey commented 1 month ago

Sorry, I don't really know. It must be a singularity thing. For what it's worth, when running --binariesMode singularity cactus uses

singularity exec -u -B $(pwd):/mnt --pwd /mnt

So you may want to try with these options (-u -B --pwd) to map your current working directory.

amsession commented 2 weeks ago

Commenting here to say our server updated and redoing with apptainer over singularity on the latest cactus build fixed this issue. Don't know if it was just apptainer or the cactus update, but closing issue.