ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
521 stars 111 forks source link

toil issuing too many jobs at a time #1465

Open macmanes opened 2 months ago

macmanes commented 2 months ago

Thanks in advance for the support. I am trying to align approx. 90 mammal genomes on a slurm-enabled cluster. Running like this:

TOIL_SLURM_ARGS="--partition=macmanes --exclude=node116,node144,node145,node146,node147,node115" \
cactus $HOME/jobs $HOME/final_cactus_input.txt mammals2.hal \
--batchSystem slurm \
--batchLogsDir batch-logs --coordinationDir $HOME/cactus_jobs \
--consCores 40 --maxMemory 500G --doubleMem true

I am getting an error in the run_lastz phase of the workflow. I believe this is because too many jobs are issued. For this dataset, it was about 46k jobs issued at the time of failure.

[2024-08-17T10:17:56-0400] [Thread-2] [E] [toil.lib.retry] Got a <class 'OSError'>: [Errno 7] Argument list too long: 'sacct' which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7f23d6838dc0>
[2024-08-17T10:17:56-0400] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] GridEngine like batch system failure: [Errno 7] Argument list too long: 'sacct'

Any way I can throttle this for instance, to permit 5k or 10k jobs to be issued at a time?

macmanes commented 2 months ago

I do see that there is a toil flag --maxJobs that might help, but not sure how to pass arguments to toil (except for slurm)

TOIL_ARGS="--maxJobs=5000"??

glennhickey commented 2 months ago

--maxJobs sounds about right. It's an option for any cactus command. Ex cactus --help

--maxJobs MAX_JOBS    Specifies the maximum number of jobs to submit to the
                        backing scheduler at once. Not supported on Mesos or
                        AWS Batch. Use 0 for unlimited. Defaults to unlimited.
adamnovak commented 2 months ago

It looks like Toil needs to add OSError (with errno 7) to the exception list here that makes us fall back from sacct to scontrol, where we don't list all jobs in the command. We also probably need some machinery to limit the maximum jobs asked about at a time (or maybe just the maximum command line length directly).

But as a workaround, limiting the max jobs in flight ought to work.

macmanes commented 2 months ago

Thanks @glennhickey and @adamnovak. I can confirm that --maxJobs does work.

For the TOIL developers, it would be great to be able to change maxJobs after submission like you can using slurm for array jobs scontrol update ArrayTaskThrottle=20 JobId=12345. This would allow a user to expand and contract given available cluster resources.