ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
499 stars 109 forks source link

Stuck on "22 jobs are issued and waiting to run" #1069

Closed kodingkoning closed 1 year ago

kodingkoning commented 1 year ago

I am running Cactus with 27 fungal species, and I can't get it to complete. It keeps stopping after giving a message about disk usage and then saying it has jobs waiting to run. I'm on a slurm cluster, and there aren't any more jobs queued to run. No files have been modified in the working directory after the successful run of paffy and the disk usage message.

I did run RepeatMasker on my input. This is the command I used for cactus: cactus --batchSystem slurm --consCores -1 ./${WORKING_DIR} ${input_file} ${output_file}

Any ideas for why it's getting stuck at this point? Is the disk usage message a warning or an error, and is it causing the failure?

For context, it is now 12:54 in my timezone, so it has been about an hour since I got the "used more disk than requested" message.

The last few lines of my output:

[2023-06-21T11:47:11-0700] [MainThread] [I] [toil-rt] 2023-06-21 11:47:11.361958: Running the command: "paffy add_mismatches -i /tmp/0273a632038555d8ac68c2f692de3161/ef80/0ae4/tmpt8ymbz44/N_sitophila_0_N_crassa_0.paf /tmp/0273a632038555d8ac68c2f692de3161/ef80/0ae4/tmpt8ymbz44/N_sitophila_0.fa /tmp/0273a632038555d8ac68c2f692de3161/ef80/0ae4/tmpt8ymbz44/N_crassa_0.fa"
[2023-06-21T11:47:16-0700] [MainThread] [I] [toil-rt] 2023-06-21 11:47:16.173441: Successfully ran: "paffy add_mismatches -i /tmp/0273a632038555d8ac68c2f692de3161/ef80/0ae4/tmpt8ymbz44/N_sitophila_0_N_crassa_0.paf /tmp/0273a632038555d8ac68c2f692de3161/ef80/0ae4/tmpt8ymbz44/N_sitophila_0.fa /tmp/0273a632038555d8ac68c2f692de3161/ef80/0ae4/tmpt8ymbz44/N_crassa_0.fa" in 4.8093 seconds
[2023-06-21T11:58:26-0700] [MainThread] [I] [toil-rt] 2023-06-21 11:58:26.692870: Successfully ran: "lastz S_fusiger_0.fa[multiple][nameparse=darkspace] P_blakesleeanus_0.fa[nameparse=darkspace] --format=paf:minimap2 --step=1 --ambiguous=iupac,100,100 --ydrop=3000" in 1775.4979 seconds
[2023-06-21T11:58:26-0700] [MainThread] [I] [toil-rt] 2023-06-21 11:58:26.693709: Running the command: "paffy add_mismatches -i /tmp/0273a632038555d8ac68c2f692de3161/5da1/e4e1/tmpuiuwxgz4/S_fusiger_0_P_blakesleeanus_0.paf /tmp/0273a632038555d8ac68c2f692de3161/5da1/e4e1/tmpuiuwxgz4/S_fusiger_0.fa /tmp/0273a632038555d8ac68c2f692de3161/5da1/e4e1/tmpuiuwxgz4/P_blakesleeanus_0.fa"
[2023-06-21T11:58:34-0700] [MainThread] [I] [toil-rt] 2023-06-21 11:58:34.022662: Successfully ran: "paffy add_mismatches -i /tmp/0273a632038555d8ac68c2f692de3161/5da1/e4e1/tmpuiuwxgz4/S_fusiger_0_P_blakesleeanus_0.paf /tmp/0273a632038555d8ac68c2f692de3161/5da1/e4e1/tmpuiuwxgz4/S_fusiger_0.fa /tmp/0273a632038555d8ac68c2f692de3161/5da1/e4e1/tmpuiuwxgz4/P_blakesleeanus_0.fa" in 7.3267 seconds
[2023-06-21T11:58:34-0700] [Thread-5 (statsAndLoggingAggregator)] [W] [toil.statsAndLogging] Got message from job at time 06-21-2023 11:58:34: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job files/for-job/kind-run_lastz/instance-100lnwug/cleanup/file-97f083301bce41268658fd1a1b223b63/stream used 127.08% disk (293.5 MiB [307707904B] used, 230.9 MiB [242136544B] requested).
[2023-06-21T12:21:00-0700] [MainThread] [I] [toil.leader] 0 jobs are running, 22 jobs are issued and waiting to run

Update:

I realized that --conCores -1 might not be a good value, so I changed it to the number of cores on the nodes I'm using. I tried restarting the job and running it in a new location with an increase on --defaultDisk as well. When I look a little ways back in the output, I see this error:

sbatch: error: Invalid numeric value "-1" for --cpus-per-task.

--cpus-per-task isn't a valid argument for cactus, so I don't know how to change that, as it seems to be for the jobs that cactus spawns.

glennhickey commented 1 year ago

A couple things:

Unfortunately, I don't think you can change --consCores and then run --restart and have it take any effect. Only the Toil-specific options work in this case. (I agree it'd be really nice to be able to do this sort of thing).

0 jobs are running, 22 jobs are issued and waiting to run means you have no jobs scheduled or running -- so nothing is happening at all. There are two possible reasons I can think for this: 1) your cluster is too busy or 2) the jobs in the queue are requesting too much memory for any node on your cluster.

If it's 1), you should be able to see that pretty easily: your jobs will be PD in squeue. If it's 2), and you can share your whole log, we could probably get a better idea what's going on.

2) wouldn't surprise me as, since the last release, Cactus is being rather conservative about estimating memory for each job. I'm going to ratchet it down a bit in the next release, hopefully, as I accumulate more test stats. There are some work-arounds ( like --consMemory), but without the logs it's hard to say if they apply.

Finally: I've only had access to a Slurm cluster for testing for a few weeks. The current Cactus release still doesn't work properly on it. The next release (eta before July) will be much better Slurm support for both progressive and pangenome versions.

kodingkoning commented 1 year ago

Starting a new run and changing --consCores to 16 worked!

There hadn't been any jobs waiting in the slurm queue, so I think they had failed to submit due to the -1 issue.

Good to know about the Slurm cluster testing!

kodingkoning commented 1 year ago

Closing -- fix was a valid input to --consCores