broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
972 stars 353 forks source link

cromwell is limiting job scheduling #5218

Open ghost opened 4 years ago

ghost commented 4 years ago

We are running joint-discovery-gatk4 in slurm with Cromwell 46.

Only a small fraction of the shards are being processed concurrently.

What adjustments can be made to process more or all shards concurrently?

Running joint-discovery-gatk4 with 50 GVCFs completed successfully in 12 hours.

Running joint-discovery-gatk4 with 100 GVCFs has been running for over 3 days.

Given that we plan to scale this to thousands of samples, this will lead to unacceptable runtimes.

This is the WDL that is being used: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl

Intervals are the same as those found here: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.hg38.wgs.inputs.json

Test WGS GVCF files (average size = 9G) were generated via HaplotypeCaller 3.5-0 with the parameter emitRefConfidence=GVCF.

In the cromwell config file, concurrent-job-limit is set to 1000.

To verify that it is functioning, it was set to 2 and 2 concurrent cromwell jobs were observed.

Call caching on or off as well as adjusting batch_size did not impact this issue.

All other throttling settings are set to default settings.

Thank you for your help.

alexiswl commented 4 years ago

Theory 1: I've also seen this issue. I've set our concurrent-job-limit parameter in the backend to 150 and yet the number of concurrent jobs (including those pending) seems to stay at around 60-65.

Could the other 90 be from tasks that still need to be 'ticked off' as complete?

My suspicion on this is that the check-alive parameter in the config is set to squeue -j ${job_id}. On our cluster we have around 3 hours to continue running commands like scontrol show job to see the metadata on the job and find logs. This is useful but, squeue -j ${job_id} still returns true well and truly after the job has completed/failed.

Could you try massively increasing the job limit (to say 10000) and see if that changes anything?

Theory 2: Your configuration file could need a scale up - it may be that the number of system io requests require increasing:

system {
  io {
    number-of-requests = 100000
    per = 100 seconds
    number-of-attempts = 50
  }

Will allow your job to make 1000 requests per second. For some of those batch calling jobs with many vcf inputs, it may be taking some time for the server to set up the task?

Theory 3: Your duplication-strategy is causing lag. Can you confirm that in providers.slurm.filesystems that you have hard-link or soft-link as the top priority for the localization and caching settings, over copying?

Alexis.

AlesMaver commented 4 years ago

We are facing a similar issue when using the SLURM backend. It appears as if cromwell is limiting the number of jobs to a particular number of concurrent jobs, which is below the number specified by the concurrent-job-limit parameter. For example, when running a scatter task with 25 jobs, only 6 are started and the rest are started after these jobs are complete (the resources on the server are sufficient to run all 25). I have looked at all the possible issues that come to mind - including:

  1. Using the less CPU intensive call caching strategy ("fingerprint"), which instantly checks for a match in call cache.
  2. Assigning all jobs a default hog group value "static" and setting the hog factor to 1.
  3. Setting the concurrent-job-limit to a very high value (2000 in our case).
  4. I think I/O is not a problem, as the jobs are run almost instantly (including call caching checks and hard-linking) and then cromwell waits with execution of the rest until the first ones are complete. The CPU and RAM utilization of cromwell are low at all times.
  5. This is seen with single and multiple workflows so the max-concurrent-workflows does not appear to be the problematic setting here.

Despite trying everything, we still only see about 25% of jobs being run concurrently. Is there another setting I am missing? Any input is much appreaciated - thank you!

alexiswl commented 4 years ago

Hi Ales, in my case it was this bug here (that I've filed and is yet to be attended to): https://broadworkbench.atlassian.net/browse/BA-6147

Run a du -hs on the directory and check the size, if it's ridiculously large, It could be that the reference files are being copied in the scatter gather rather than hard-linked or soft-linked. This is meant to be resolved with cached-copy for shared file systems but doesn't appear to work, particularly if the reference files are on a separate mount point to the working cromwell directory.

If you do find that this is the case, a workaround that I found is that you can create a step of the workflow at the start that takes in each of these large files and merely runs a cp to their output. Then, rather than using the reference argument in the workflow, use the outputs from that first step that runs a cp. That will ensure that all of the reference files in the scatter gather are hard-linked rather than copied.

Kind regards, Alexis.

AlesMaver commented 4 years ago

Dear Alexis, many thanks for the suggestion. I have checked and the files seem to be correctly hard-linked in our case and the folder sizes are as they should be. We have about 1Tb of reference files and the cromwell-executions directory with several workflow folders is only about 1.2 Tb in size, so it appears that the files are not being copied, but correctly hard-linked. I have checked again and it seems that the cap is at about 35 concurrently submitted tasks, even when several workflows are submitted, each set to scatter about 20-30 jobs. When running the jobs outsite cromwell server mode, we usually have several jobs in the "pending" state on slurm, but cromwell never submits more than 35 at once and the jobs never get to the pending list. It is almost as if there is some kind of hard limit in the number of jobs submitted, not controlled by the concurrent-job-limit. Thanks again! Best, Ales

alexiswl commented 4 years ago

Thanks for this Ales,

I wonder if it's a slurm thing? https://slurm.schedmd.com/resource_limits.html

Some key things to look at from here: https://slurm.schedmd.com/slurm.conf.html

MaxJobCount - Number of jobs in the active database - this includes recently finished jobs, new jobs won't go into pending. MaxSubmitJobs - Can be set for a per user too. Could Cromwell be continuously trying to submit jobs and waiting until a new position is available?

AlesMaver commented 4 years ago

Dear Alexis, many thanks - I really appreciate the very useful inputs! I asked the team operating the slurm server and they say that no such limits are in effect. The number of permitted jobs is oddly specific, it stays at maximum 37 jobs (either pending or running). It really appears that cromwell is doing something to stop further submission of jobs (they are tracked as "Running" in cromwell but no call within a job takes place until there are slots available). Once the jobs submitted exceed this queue, then cromwell server only generates this log:

2020-07-08 17:13:15,328 INFO  - MaterializeWorkflowDescriptorActor [UUID(9db83645)]: Parsing workflow as WDL 1.0
2020-07-08 17:13:16,442 INFO  - MaterializeWorkflowDescriptorActor [UUID(9db83645)]: Call-to-Backend assignments: FastqToVCF.VariantFiltrationSNP -> slurm

And then waits in this state until another job is finished, without even checking if the call is cached or anything else. Please note that the number of permitted workflows is set to a value higher than the number of workflows we usually submit.

One interesting observation - if I stop the cromwell server process (Control-C), all the jobs that were not previously submitted to slurm, get submitted immediately (as if cromwell was constantly blocking the submission for an unclear reason).

Any input is, again, very much valuable! Thanks a lot.

Ps. I just wanted to add that adding a second server process and submitting tasks to it allows submitting more jobs, so it is very likely that the slurm system is not limiting submissions.

alexiswl commented 3 years ago

Hi Ales,

I came across this problem again when running through AWSParallelCluster with a slurm backend. It wasn't the file system in this case, but having a limited head-node I found only 8 jobs were running at a time. I removed the --wait from the sbatch script and it let all the jobs run at once.

Alexis.

AlesMaver commented 3 years ago

Dear Alexis! It works, it works, it works! Removing the —wait from sbatch unblocks all jobs! Thank you so much for taking the time to share the solution, can’t tell you how much I appreciate it! Thank you😊

alexiswl commented 3 years ago

Hi Ales,

This really brings a smile to my face to see we've finally solved it!!

Have a great day!

Alexis

AlesMaver commented 3 years ago

This was such a frustrating issue for me and I spent days debugging it, so I owe you a drink or two for sure! ;) Wishing you a great day too and thanks so much for sharing the solution😃