Closed lauragails closed 1 year ago
We do something like this already for slurm, but we don't have an lsf cluster for testing, so it's really useful to get feedback like this. I've added a patch, but I don't want to merge until it has been tested.
Could you test this by:
ACCOUNT
line in variables.env to change it from 100humans
to your account nameIn variables.env
, you might also want to change:
-gpu "mode=exclusive_process"
, but check with your HPC sysadmin if you're unfamiliar with these argumentsAbsolutely! Will let you know how it goes.
We're closer but not there yet. After running I see
resources: mem_mb=192000, disk_mb=144043, tmpdir=/tmp, partition=$PARTITION, threads=1, out=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out, err=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err, extra=, account=account-name
I made the changes in variables.env already, including commenting line 9 but the -P flag isn't recognized. I have to go to a seminar but will write further when I get back!
Before we dig deeper, just making sure:
Your profiles/lsf/config.yaml looks exactly like this:
reason: true
rerun-incomplete: true
keep-going: true
printshellcmds: true
local-cores: 4
max-threads: 256
jobs: 500
max-jobs-per-second: 1
use-conda: true
conda-frontend: mamba
latency-wait: 120
use-singularity: true
singularity-args: '--nv '
cluster: bsub -cwd
-P {resources.account}
-q {resources.partition}
-n {threads}
-M {resources.mem_mb}
-o {resources.out}
-e {resources.err} {resources.extra}
default-resources:
- account='$ACCOUNT'
- partition='$PARTITION'
- tmpdir=system_tmpdir
- threads=1
- mem_mb=8000*threads
- out='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out'
- err='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err'
- extra=''
And the top of variables.env
looks like:
export TMPDIR=/tmp # or your scratch dir
export PARTITION=compute # or your default queue
export ACCOUNT=100humans # or your account name
There's one more place where bsub is called, when launching the top level job submission scripts that launch snakemake (e.g., process_smrtcells.lsf.sh. We might need to add the params there as well, like:
#!/bin/bash
#BSUB -cwd
#BSUB -P account-name
#BSUB -L /bin/bash
#BSUB -q default
#BSUB -n 4
#BSUB -o ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out
#BSUB -e ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err
If you make this change and it still doesn't work, can you send me (either here or via email) any log files generated by the failed run. If it generates a lot of logs, you can send a representative example, but I'd like to see at least the snakemake log and one of the failed bsub submission logs.
workflow/process_smrtcells.lsf.sh
#!/bin/bash
#BSUB -P acc_name
#BSUB -L /bin/bash
#BSUB -q premium
#BSUB -n 4
#BSUB -o ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out
#BSUB -e ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err
# USAGE: bsub workflow/process_smrtcells.lsf.sh
# set umask to avoid locking each other out of directories
umask 002
# get variables from workflow/variables.env
source workflow/variables.env
# execute snakemake (unlock added by Laura)
snakemake \
--profile workflow/profiles/lsf \
--snakefile workflow/process_smrtcells.smk
workflow/profiles/lsf/config.yaml
reason: true
rerun-incomplete: true
keep-going: true
printshellcmds: true
local-cores: 4
max-threads: 256
jobs: 500
max-jobs-per-second: 1
use-conda: true
conda-frontend: mamba
latency-wait: 120
use-singularity: true
singularity-args: '--nv '
cluster: bsub -cwd
-P {resources.account}
-q {resources.partition}
-n {threads}
-M {resources.mem_mb}
-o {resources.out}
-e {resources.err} {resources.extra}
default-resources:
- account='$ACCOUNT'
- partition='$PARTITION'
- tmpdir=system_tmpdir
- threads=1
- mem_mb=8000*threads
- out='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out'
- err='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err'
- extra=''
workflow/variables.env
export TMPDIR=/path/to/tmpdir
export PARTITION=premium
export ACCOUNT=acc_name
export SINGULARITY_TMPDIR="$TMPDIR"
export SINGULARITY_BIND="$TMPDIR"
# deepvariant make_examples and postprocess_variants require at least avx2
# this can be commented out if your scheduler doesn't constrain nodes in this way
#export DEEPVARIANT_AVX2_CONSTRAINT='--constraint=avx512' # LS COMMENTED
# these are required and apply if cpu_only=False in config.yaml
export DEEPVARIANT_GPU_PARTITION=ml
export DEEPVARIANT_GPU_EXTRA='--gpus=1'
# this is optional and applies if cpu_only=True in config.yaml
export DEEPVARIANT_CPU_EXTRA='--exclusive'
representative error
[Thu Nov 17 16:00:34 2022]
rule smrtcell_stats_ubam:
input: smrtcells/ready/PBG_3390_0000008833/m00000_000000_000000.hifi_reads.bam
output: samples/PBG_3390_0000008833/smrtcell_stats/m00000_000000_000000.read_length_and_quality.tsv
log: samples/PBG_3390_0000008833/logs/smrtcell_stats/m00000_000000_000000.log
jobid: 51
benchmark: samples/PBG_3390_0000008833/benchmarks/smrtcell_stats/m00000_000000_000000.tsv
reason: Missing output files: samples/PBG_3390_0000008833/smrtcell_stats/m00000_000000_000000.read_length_and_quality.tsv
wildcards: sample=PBG_3390_0000008833, movie=m00000_000000_000000
resources: mem_mb=8000, disk_mb=196609, tmpdir=/tmp, account=$ACCOUNT, partition=$PARTITION, threads=1, out=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out, err=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err, extra=
(python3 workflow/scripts/extract_read_length_and_qual.py smrtcells/ready/PBG_3390_0000008833/m00000_000000_000000.hifi_reads.bam > samples/PBG_3390_0000008833/smrtcell_stats/m00000_000000_000000.read_length_and_quality.tsv) > samples/PBG_3390_0000008833/logs/smrtcell_stats/m00000_000000_000000.log 2>&1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Project name must be specified using bsub -P.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Request aborted by esub. Job not submitted.
Error submitting jobscript (exit code 255):
aaaah I bet I know what happened: the list of jobs was made prior to me fixing the things, so those lists weren't remade. That took a while before, and wasn't repeated. It probably has old flags/things on it.
I'm going to start everything from scratch with the new files and will let you know how it goes!
Edit: Just saw your previous message. We can save the message below in case there's still a problem.
Thanks for your patience. We're trying to support clusters that we don't have available for testing.
I'm wondering if the env variables aren't being interpolated.
First checked one of the top level snakemake logs to see how this looks for slurm:
[Mon Nov 14 15:47:26 2022]
rule samtools_fasta:
input: smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam
output: samples/HG005/jellyfish/m64017_200723_190224.fasta
log: samples/HG005/logs/samtools/fasta/m64017_200723_190224.log
jobid: 135
benchmark: samples/HG005/benchmarks/samtools/fasta/m64017_200723_190224.tsv
reason: Missing output files: samples/HG005/jellyfish/m64017_200723_190224.fasta
wildcards: sample=HG005, movie=m64017_200723_190224
threads: 4
resources: mem_mb=32000, disk_mb=1000, tmpdir=/scratch, account=$ACCOUNT, partition=$PARTITION, threads=1, out=cluster_logs/slurm-%x-%j-%N.out, extra=
(samtools fasta -@ 3 smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam > samples/HG005/jellyfish/m64017_200723_190224.fasta) > samples/HG005/logs/samtools/fasta/m64017_200723_190224.log 2>&1
Submitted job 135 with external jobid 'Submitted batch job 32688205'.
So I still see the variables uninterpolated in the resources (e.g., $ACCOUNT
), same as for yours. Next, looking at the job that was spawned by the process above, corresponding to batch job 32688205.
[Mon Nov 14 15:47:44 2022]
rule samtools_fasta:
input: smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam
output: samples/HG005/jellyfish/m64017_200723_190224.fasta
log: samples/HG005/logs/samtools/fasta/m64017_200723_190224.log
jobid: 0
benchmark: samples/HG005/benchmarks/samtools/fasta/m64017_200723_190224.tsv
wildcards: sample=HG005, movie=m64017_200723_190224
threads: 4
resources: mem_mb=32000, disk_mb=1000, tmpdir=/scratch, account=100humans, partition=compute, threads=1, out=cluster_logs/slurm-%x-%j-%N.out, extra=
(samtools fasta -@ 3 smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam > samples/HG005/jellyfish/m64017_200723_190224.fasta) > samples/HG005/logs/samtools/fasta/m64017_200723_190224.log 2>&1
Activating conda environment: /pbi/flash/wrowell/testing/.snakemake/conda/252443608ffe510fcf2ff978d2d08708
The variables have been interpolated here. That's kind of confusing. I thought maybe it had something to do with single-quoting, but it seems like snakemake is still interpolating env variables in these strings.
What happens if we stop trying to be clever and set account and partition explicitly in workflow/profiles/lsf/config.yaml
without using env variables?
...
default-resources:
- account=acc_name
- partition=premium
- tmpdir=system_tmpdir
...
I still got the error when I did that earlier, but it might be because the jobs need to be completely re-made from the get-go. Good idea RE hard-coding this in from square 1, which I'm doing now!
Didn't work.
But stepping back, what benefit does the lsf wrapper get you vs. submitting a local script - ie process_smrtcells.local.sh
- within an external bsub command (ie like I submit any other "regular" piece of software)?
I see from the tutorial that
We recommend at least 80 cores and 1TB RAM for local execution. Local execution will use all available cores.
I'm running this overnight and hopefully it will get picked up!
The benefit of having snakemake submit jobs to the scheduler is that many processes can run in parallel. Each movie alignment job takes ~24 threads, so three movies would use nearly all available cores on that 80 core instance for ~1.5h. It won't matter as much for process_smrtcells
, but at the process_sample
level, there are a lot of jobs to schedule.
That being said, we've been running some jobs lately using process_sample.local.sh
on a node with 256 cores + 512GB RAM + NVIDIA A30 and they've been completing within a day typically. We tweak a few settings to give a few jobs more threads.
In order for our lsf cluster to accept a job using
workflow/process_smrtcells.lsf.sh
, I need the flagbsub -P account-name
. account-name is static, so I can hard-code this.To fix this, I went into
profiles/lsf/config.yaml
and added-P 'account-name'
in the "cluster:" section.This isn't being recognized though. Is there something else I can do?
The good news:
process_smrtcells.local.sh
does appear to be working, however I stopped it because I'd prefer to use lsf.Thank you for your guidance here!