epi2me-labs / wf-clone-validation

Other
23 stars 18 forks source link

[Bug]: Add queue executor memory requirement for tasks #14

Closed multimeric closed 10 months ago

multimeric commented 1 year ago

What happened?

The workflow doesn't define memory requirements for any processes when using an HPC executor (ie not local or cloud). This causes nextflow to submit jobs to the job queue (in my case SLURM) with no memory specification. If the SLURM config has a default memory that is too low this will cause all jobs to fail. I suggest setting the default process memory requirement for all executors, not just the local executor.

Operating System

ubuntu 20.04

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - Execution Profile

Conda

Workflow Version

Current master

Relevant log output

Command wrapper:
  /var/spool/slurmd/job8708479/slurm_script: line 331: /bin/activate: No such file or directory
  slurmstepd: error: Detected 1 oom-kill event(s) in StepId=8708479.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
cjw85 commented 1 year ago

Doing this for "all executors" would add a lot of bloat to the default configs given the number of executors that Nextflow supports. We have a local and aws context set up rather selfishly for our our needs (for use in our CI), though we have discussed removing these.

We should specify limits per process, though this can be tricky when the amount used is difficult to predict ahead of time. Which process is failing with the OOM? I'm intrigued that any process in wf-clone-validation uses a "large" amount of memory, what is the default request in your slurm configuration?

You can change the requests yourself without making any changes to the existing workflow by providing an additional configuration file with the -c option to Nextflow. You can either specify the executor scope or use process selectors to set items for specific processes.

multimeric commented 1 year ago

Oh, I don't mean adding all executors individually. I just meant that, since 8GB seems to be a reasonable minimum on the other executors (e.g. cloud), maybe you should set it as the global process default, so that HPC gets the benefit of that config as well.

Per-process limits would of course be even better, but that requires a lot of effort to calculate.

Our SLURM is configured a bit strangely, such that the default memory is too low to run anything at all (it's only a few MB of RAM). But still, I think other HPC users could benefit from a sensible default here.

cjw85 commented 10 months ago

Before closing this (we're having a bit of a tidy-up), we've been having discussions recently around resources used by workflows and making things more transparent and require less from the user.

We'll hopefully have some new behaviour across all our workflows shortly.