Improved ergonomics and control

mbhall88 commented 2 years ago

Hi Slurm team!

This is a somewhat big PR. Maybe you don't want the stuff in it, but I thought I would PR it in case it is of interest. No worries if you don't want it, I can just maintain my own fork :)

There are effectively four main additions/changes

1. Allow a human-friendly time specification

I find the slurm time quite cumbersome. (From the added docs)

We support specifying the time for a rule in a "human-friendly" format. For example, a rule with a time limit of 4 hours and 30 minutes can be specified as time="4h30m".

Supported (case-insensitive) time units are:

w: week
d: day
h: hour
m: minute
s: second

2. A class for dealing with job names and log paths.

This borrows somewhat, but extends, the functionality in the LSF profile. It adds two new options to the config: cluster_jobname and cluster_logpath. These are essentially patterns that the user can specify (or not) to dictate how the jobs are named by sbatch, and the path for the output and error streams of the job. (From the added docs)

For job name and log paths we provide a custom pattern syntax.

%r: Rule name. If it is a group job, the group ID will be used instead.
%i: Snakemake job ID.
%w: Wildcards string. e.g., wildcards A and B will be concatenated as A=<val>.B=<val>
%U: A random universally unique identifier (UUID).
%S: A shortened version of %U. For example, 16fd2706-8baf-433b-82eb-8c7fada847da would become 16fd2706.
%T: The Unix timestamp (rounded to an integer).

This is in addition to the Slurm filename patterns for the logpath.

As an example, I have my logpath pattern set to logs/cluster/%r/%w/%T-jobid-%j. An example (error) log file is logs/cluster/minimap2_align/sample=GE044_56_5D/1652837847-jobid-35695070.err

3. I've added some common snakemake options to the cookiecutter JSON

Closes #80. I also set the defaults to the snakemake defaults.

--use-singularity
--use-conda
--printshellcmds
--latency-wait
--max-jobs-per-second
--max-status-checks-per-second
--restart-times

4. Support for passing extra options to `sbatch` via the `slurm` parameter in a rule's `resources`

Closes #86

(From the added docs)

Resources not listed in RESOURCE_MAPPING can also be specified with the special slurm parameter to resources. For example, to specify a specific QoS and 2 GPU resources for a rule gpu_stuff

rule gpu_stuff:
    resources:
        time="12:00:00",
        mem_mb=8000,
        partition="gpu",
        slurm="qos=gpuqos gres=gpu:2"

Note: slurm must be a space-separated string of the form <option>=<value>. The <option> names must match the long option name of sbatch (see list here; can be snake_case or kebab-case). Flags (i.e., options that do not take a value) should be given without a value. For example, to use the wait flag, you would pass slurm="qos=gpuqos wait".

I merged in the recent commits, so I hope I haven't somehow deleted anything. I have been running a pipeline with a sidecar on my slurm cluster using this profile and it all seems to working well.

Anyway, hope you find this PR useful.

maarten-k commented 2 years ago

I like the improvements. Something to improve is the defaults for job names and log paths when setting up the profile. Setting these values is not straightforward and leaving them blank will not enhance your debugging experience.

mbhall88 commented 2 years ago

Happy to discuss and decide on sensible defaults here.

I currently use logs/cluster/%r/%w/%T-jobid-%j for my cluster logpath and %r_%w for cluster job name

Examples of each

logs/cluster/minimap2_align/sample=GE044_56_5D/1652837847-jobid-35695070.out
minimap2_align_sample=GE044_56_5D

The reasons for these choices is I normally have pipelines with thousands of jobs and this ensures I don't have thousands of files in a single directory. Having the timestamp at the start of the file name is also useful for easily determining which log file is the most recent when there has been multiple reruns (without having to look at the timestamps with ls)

Other ideas?

maarten-k commented 2 years ago

I am not sure that having a directory with a wildcard name is a good default: I think most workflows have only one slurm output file per wildcard and rule combination (if everything runs smoothly on the first try/please correct me if I'm wrong). Adding this directory might be a bit of cd-ing and ls-ing too much.

Furthermore, I do not like "cluster" in the path since there are probably lots of workflows with a cluster rule in them. You might end up with a file in the logs/cluster/cluster/sample=GE044_56_5D/1652837847-jobid-35695070.out which might be a cause a bit of confusion. I think using "slurm" might be a bit more clear.

I tend to use only the slurm job ID. Timestamp is also present on the filesystem itself and UNIX timestamps are hard to interpret for a human being(computers love them ;))

I (soft) think using: logs/slurm/%r/%j-%w.out which should result to logs/slurm/minimap2_align//1652837847-sample=GE044_56_5D.out

might be a slightly better default. However, deciding what sane defaults are for other people is a tough cookie.

The job name is good. By default squeue displays only 8 characters and adding a postfix like "snakejob" will not help clarity.

mbhall88 commented 2 years ago

All sensible suggestions @maarten-k. I've made those changes in https://github.com/Snakemake-Profiles/slurm/pull/95/commits/f3899a229fc2b3056cf16f0930c3e6d6e3ba0187

percyfal commented 2 years ago

Thanks a lot for this PR @mbhall88. All of what you propose sound like much needed improvements - I was reminded about some of them today when I'm actually teaching snakemake. I'll have a closer look this afternoon but AFAICT from a quick glance everything looks good to go :+1: !

Snakemake-Profiles / slurm