epigen / MrBiomics

MrBiomics - Modules & Recipes augment Bioinformatics for Multi-Omics Analyses at Scale
MIT License
31 stars 1 forks source link

Bump to Snakemake v8 #12

Closed sreichl closed 1 month ago

sreichl commented 3 months ago
burtonjake commented 3 months ago

~Install the latest Snakemake version (document which exactly)~ 8.16.0 ~Run the tutorial locally with it~ ~Get to run the tutorial using SLURM (executor plugin) on the CeMM cluster~ pip install snakemake-executor-plugin-slurm N.B. Not installable via mamba/conda. ~Document what to do to get it to run~

  1. Create a folder to store the workflow configuration adjacent to the Snakefile. mkdir snake_slurm.
  2. Make a file called config.v8+.yaml in that directory.
  3. Fill out the configuration. It can be adapted from this template:
    
    # Use spaces instead of tabs :'(
    # Note that raw string arguments need double quotes (see slurm_extra)
    # Remember that the slurm parition and qos must match on the CeMM cluster.
    executor: slurm
    jobs: 100
    default-resources:
    slurm_account: lab_bock
    slurm_partition: tinyq
    runtime: 30 # in minutes
    mem: 2G
    cpus_per_task: 1
    nodes: 1
    slurm_extra: "'--qos=tinyq'" # Note the extra quoting!

You can also set slurm resources on a per-rule basis.

set-resources:

myrule:

slurm_partition: longq

cpus_per_task=8

mem_mb=14000

You can force some rules to run on the login node using localrules:

localrules: ,

In which case it should be located AT THE TOP of the Snakefile.


4. Run snake with the workflow profile specified to use slurm.
`snakemake --workflow-profile snake_slurm`
5. You can view running jobs using an adapted version of squeue that pads the name and comment fields so you can see all the information snakemake has added to clarify the jobs.
`squeue -u $USER -o %i,%P,%.10j,%.40k`.

>   * [ ]  in general for any executor plugin (akin to [here](https://github.com/epigen/mr.pareto#:~:text=location%20of%20your-,cluster%20profile,-(i.e.%2C%20the)) and [here](https://github.com/epigen/mr.pareto?tab=readme-ov-file#execution))
>   * [ ]  specifically SLURM at CeMM (will replace this [repo](https://github.com/epigen/cemm.slurm.sm)/[section](https://github.com/epigen/mr.pareto?tab=readme-ov-file#cemm-users) of README)
> * [ ]  (Get to) Run the unsupervised_analysis pipeline with test data using SLURM on CeMM cluster
>   
>   * [ ]  change the `partition` from `params` to `resources`
> * [ ]  Document the necessary changes
> * [ ]  Create issue for each MR.P module to address the change to bump it to v8 **or** one central in this repo with a list of all modules like here [switch all visualizations from panels to single plots #2](https://github.com/epigen/mr.pareto/issues/2)
> * [ ]  <add above created issue(s) here>
> * [ ]  adapt mr.pareto README accordingly to reflect Snakemake v8 usage in all regards
sreichl commented 3 months ago

@burtonjake great progress! Please find out and document how to...

burtonjake commented 3 months ago

The slurm executor maps threads and mem_mb requirements to slurm (threads -> cpus_per_task). https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html#ordinary-smp-jobs


A minimal "cluster profile" to run workflows via SLURM at CeMM is:

# Remember that the slurm parition and qos must match on the CeMM cluster.
executor: slurm
jobs: 100
default-resources:
    # slurm_account, partition, and runtime are required.
    # match the CeMM intranet: https://cemmat.sharepoint.com/sites/IT-Resources/SitePages/Submitting-Slurm-Jobs.aspx
    slurm_account: lab_bock
    slurm_partition: tinyq
    runtime: 120 # in minutes
    slurm_extra: "'--qos=tinyq'" # Note the extra quoting!

It appears that you cannot specific default memory requirements etc here as they tend to conflict with that in existing workflows where the datatype of mem: 2G which is an integer (automatic conversion by snakemake even if you write "2G") does not match when defined as a string in workflow files, e.g., mem: config.get("mem", "1600"). Therefore this is all that is needed to get jobs to run on slurm.

This default config file is stored in ~/.config/snakemake/<config_name>/ where <config_name> is for example cluster and can then be applied with snakemake --sdm conda --profile cluster. Note that the config file has to be named config.<snakemake_supported_version>.yaml. For example: config.v8+.yaml.

sreichl commented 3 months ago

here it seems like you can name and store it wherever you want, as long as you define the global variable SNAKEMAKE_PROFILE accordingly. Maybe I am missing someting.

https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html#using-profiles

sreichl commented 3 months ago

in the main docs, they explain profiles (a concept new to me as I have been developing without them and my latest Snakemake version is 7.15.2): https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles

burtonjake commented 3 months ago

Seems so.

"Profile has to be given as either absolute path, relative path or name of a directory available in either /etc/xdg/snakemake or /home/jburton/.config/snakemake."

[Works: Relative Path] SNAKEMAKE_PROFILE=profiles/testprofile snakemake [Works: Name of profile in default dir] SNAKEMAKE_PROFILE=cluster snakemake etc

burtonjake commented 3 months ago

There's a potential 'gotcha' in that if don't install the full snakemake with bells and whistles then some of the MR PARETO workflows don't work out of the box. For example if you follow the snakemake tutorial to get snakemake on your system you won't end up with pandas. You can simulate this with mamba create -c conda-forge -c bioconda -n snakemake8-mini snakemake-minimal.

My view is that if a workflow depends on a particular python package [to run the Snakemake file] then this should be documented. The snakemake way for this is to have a line at the top of the workflow.

conda:
    "envs/global.yaml"

And to add the packages to envs/global.yaml that you need. These are injected using conda before running the rest of the snakefile. For example:

(snakemake8-mini) [jburton@d001 envs]$ cat global.yaml 
channels:
  - conda-forge
  - bioconda
  - nodefaults
dependencies:
  - pandas
sreichl commented 3 months ago

with exact versions! (that's a MR.P requirement to increase reproducibility)

sreichl commented 3 months ago

is this also a problem for full Snakemake installations? If no, what are the advantages of minimal installations?

sreichl commented 2 months ago

Instructions

# install Snakemake 8.20.1
conda create -c conda-forge -c bioconda -n snakemake8_20_1 snakemake 

# install SLURM executor plugin in Snakemake environment
# https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html#
conda install snakemake-executor-plugin-slurm

CeMM SLURM repo: v3.0.0 supports all Snakemake versions, below the relevant config files:

LOG of Snakemake 8 bump:

tasks