[MRG] adding snakemake `--profile`

mr-eyes commented 2 years ago

Resolves #32

bluegenes commented 2 years ago

thanks Mo! It would be really great to provide an example snakemake rule where you set time, partition, etc within the snakemake rule, so folks can see how that happens :)

I have some examples over here if you want to just swipe: http://bluegenes.github.io/hpc-snakemake-tips/ e.g. -

rule quality_trim:
    input: 
        reads="rnaseq/raw_data/{sample}.fq.gz",
        adapters="TruSeq2-SE.fa",
    output: "rnaseq/quality/{sample}.qc.fq.gz"
    threads: 1
    resources:
        mem_mb=1000,
        runtime=10
    shell:
        """
        trimmomatic SE {input.reads} {output} \
        ILLUMINACLIP:{input.adapters}:2:0:15 \
        LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 MINLEN:25    
        """

bluegenes commented 2 years ago

One more thought -- I see the default jobs is 100 and default partition is med2 -- can we change these to follow our recommended queue usage?

options: default low2 to keep default jobs at 100, or default jobs <= 30 on med2. alternatively (or in addition), you can add resources: [cpus=30, mem_mb=350000] to limit cpu and memory allocation. The one caveat is that we don't need these limits for low2 or bml, so they may be annoying to have in the cluster profile when running on those queues.

SichongP commented 2 years ago

A little trick that worked for me is using cpus_med2 and cpus_bmm to separate resource use on different partitions. Then I only set resource limit for med2 and bmm partition using resources: [cpus_med2=30, cpus_bmm=30]. This way snakemake will limit resources usage on medium priority partitions but won't restrict low partition usage.

Of course you will have to set cpus_med2 or cpus_low2 in your resource keyword for each rule instead of default parameter cpus.

As a bonus, you can use this function to automate which partition snakemake should submit your job to:

def getPartition(wildcards, resources):
    # Determine partition for each rule based on resources requested
    for key in resources.keys():
        if 'bmm' in key and int(resources['cpus_bmm']) > 0:
            return 'bmm'
        elif 'med' in key and int(resources['cpus_med']) > 0:
            return 'med2'
    if int(resources['mem_mb']) / int(resources['cpus']) > 4000:
        return 'bml'
    else:
        return 'low2'

And then in rule definition:

...
params: partition=getPartition
...

In my profile, I set following default resources:

default-resources: [cpus_bmm=0, cpus_med2=0, cpus=1, mem_mb_bmm=0, mem_mb_med2=0,, mem_mb=2000, time_min=120, node=1, task=1, download=0]

mr-eyes commented 2 years ago

One more thought -- I see the default jobs is 100 and default partition is med2 -- can we change these to follow our recommended queue usage?

options: default low2 to keep default jobs at 100, or default jobs <= 30 on med2. alternatively (or in addition), you can add resources: [cpus=30, mem_mb=350000] to limit cpu and memory allocation. The one caveat is that we don't need these limits for low2 or bml, so they may be annoying to have in the cluster profile when running on those queues.

Thanks, @bluegenes for the suggestions. I have edited the default parameters for partition. I don't think setting the default mem_mb to 350GB is a good idea because that will consume a lot of memory for the total running job on default parameters. Same with the cpu. What do you think?

mr-eyes commented 2 years ago

A little trick that worked for me is using cpus_med2 and cpus_bmm to separate resource use on different partitions.

That's a cool workaround, thanks for sharing! I think controlling the default parameters for each partition separately can also work using Python functions with the partition name as input.

bluegenes commented 2 years ago

I don't think setting the default mem_mb to 350GB is a good idea because that will consume a lot of memory for the total running job on default parameters. Same with the cpu. What do you think?

As I've used it,resources at the top level doesn't actually allocate that memory (or cpu/etc), it just limits the total amount you can allocate at once. The resources within each rule does try to allocate that particular amount of memory/etc, as does default-resources which is used to fill in resources for rules missing any of the default resource parameters. https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources

That's a cool workaround, thanks for sharing! I think controlling the default parameters for each partition separately can also work using Python functions with the partition name as input.

This sounds like an excellent workaround. If we can set limits for med, high partitions by default and no limits for low, that would be really helpful. Of course for rare cases (deadlines, huge jobs, etc), users can override the limits by setting different ones on the command line with, e.g. --resources mem_mb=XX.

ctb commented 2 years ago

this is all greek to me. Maybe we need (or could use) a lab meeting tutorial/demo on cool farm/snakemake hacks...

bluegenes commented 2 years ago

this is all greek to me. Maybe we need (or could use) a lab meeting tutorial/demo on cool farm/snakemake hacks...

😂 I ran an ILLO on farm/snakemake (w/profiles and resource limitation hacks!) back in Aug 2020, but we could do another/up-to-date one? @mr-eyes, interested in doing this with me? Partition-specific allocation using this profile is already making my life better! @SichongP, I would also love your feedback on what we come up with if you have time, in case you have more/different tricks you use.

Back when profiles were newer, the hard part was figuring out how to introduce them without leaving folks behind who are newer to snakemake. But now I think profile setup is something we should just help everyone do as soon as possible, since it makes so many things easier (and doesn't add much complication, aside from setup).

ILLO from 8/24/2020 - http://bluegenes.github.io/hpc-snakemake-tips/ My practices have changed a little since then, but not a ton. I think for the next one, I would start with profiles and assume snakemake conda environment management :)

mr-eyes commented 2 years ago

@mr-eyes, interested in doing this with me?

Sure!

dib-lab / farm-notes

[MRG] adding snakemake `--profile` #33