PacificBiosciences / pb-human-wgs-workflow-snakemake

Workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads
BSD 3-Clause Clear License
38 stars 20 forks source link

LSF submission: Project name must be specified using bsub -P. #128

Closed lauragails closed 1 year ago

lauragails commented 1 year ago

In order for our lsf cluster to accept a job using workflow/process_smrtcells.lsf.sh, I need the flag bsub -P account-name. account-name is static, so I can hard-code this.

To fix this, I went into profiles/lsf/config.yaml and added -P 'account-name' in the "cluster:" section.

This isn't being recognized though. Is there something else I can do?

The good news: process_smrtcells.local.sh does appear to be working, however I stopped it because I'd prefer to use lsf.

Thank you for your guidance here!

williamrowell commented 1 year ago

We do something like this already for slurm, but we don't have an lsf cluster for testing, so it's really useful to get feedback like this. I've added a patch, but I don't want to merge until it has been tested.

Could you test this by:

In variables.env, you might also want to change:

lauragails commented 1 year ago

Absolutely! Will let you know how it goes.

lauragails commented 1 year ago

We're closer but not there yet. After running I see

resources: mem_mb=192000, disk_mb=144043, tmpdir=/tmp, partition=$PARTITION, threads=1, out=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out, err=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err, extra=, account=account-name

I made the changes in variables.env already, including commenting line 9 but the -P flag isn't recognized. I have to go to a seminar but will write further when I get back!

williamrowell commented 1 year ago

Before we dig deeper, just making sure:

Your profiles/lsf/config.yaml looks exactly like this:

reason: true
rerun-incomplete: true
keep-going: true
printshellcmds: true
local-cores: 4
max-threads: 256
jobs: 500
max-jobs-per-second: 1
use-conda: true
conda-frontend: mamba
latency-wait: 120
use-singularity: true
singularity-args: '--nv '
cluster: bsub -cwd
              -P {resources.account}
              -q {resources.partition}
              -n {threads}
              -M {resources.mem_mb}
              -o {resources.out}
              -e {resources.err} {resources.extra}
default-resources:
  - account='$ACCOUNT'
  - partition='$PARTITION'
  - tmpdir=system_tmpdir
  - threads=1
  - mem_mb=8000*threads
  - out='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out'
  - err='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err'
  - extra=''

And the top of variables.env looks like:

export TMPDIR=/tmp  # or your scratch dir
export PARTITION=compute   # or your default queue
export ACCOUNT=100humans   # or your account name

There's one more place where bsub is called, when launching the top level job submission scripts that launch snakemake (e.g., process_smrtcells.lsf.sh. We might need to add the params there as well, like:

#!/bin/bash
#BSUB -cwd
#BSUB -P account-name
#BSUB -L /bin/bash
#BSUB -q default
#BSUB -n 4
#BSUB -o ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out
#BSUB -e ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err

If you make this change and it still doesn't work, can you send me (either here or via email) any log files generated by the failed run. If it generates a lot of logs, you can send a representative example, but I'd like to see at least the snakemake log and one of the failed bsub submission logs.

lauragails commented 1 year ago

workflow/process_smrtcells.lsf.sh

#!/bin/bash
#BSUB -P acc_name
#BSUB -L /bin/bash
#BSUB -q premium
#BSUB -n 4
#BSUB -o ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out
#BSUB -e ./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err

# USAGE: bsub workflow/process_smrtcells.lsf.sh

# set umask to avoid locking each other out of directories
umask 002

# get variables from workflow/variables.env
source workflow/variables.env

# execute snakemake (unlock added by Laura)
snakemake \
    --profile workflow/profiles/lsf \
    --snakefile workflow/process_smrtcells.smk

workflow/profiles/lsf/config.yaml

reason: true
rerun-incomplete: true
keep-going: true
printshellcmds: true
local-cores: 4
max-threads: 256
jobs: 500
max-jobs-per-second: 1
use-conda: true
conda-frontend: mamba
latency-wait: 120
use-singularity: true
singularity-args: '--nv '
cluster: bsub -cwd
              -P {resources.account}
              -q {resources.partition}
              -n {threads}
              -M {resources.mem_mb}
              -o {resources.out}
              -e {resources.err} {resources.extra}
default-resources:
  - account='$ACCOUNT'
  - partition='$PARTITION'
  - tmpdir=system_tmpdir
  - threads=1
  - mem_mb=8000*threads
  - out='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out'
  - err='./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err'
  - extra=''

workflow/variables.env

export TMPDIR=/path/to/tmpdir  
export PARTITION=premium
export ACCOUNT=acc_name
export SINGULARITY_TMPDIR="$TMPDIR"
export SINGULARITY_BIND="$TMPDIR"

# deepvariant make_examples and postprocess_variants require at least avx2
# this can be commented out if your scheduler doesn't constrain nodes in this way
#export DEEPVARIANT_AVX2_CONSTRAINT='--constraint=avx512' # LS COMMENTED

# these are required and apply if cpu_only=False in config.yaml
export DEEPVARIANT_GPU_PARTITION=ml
export DEEPVARIANT_GPU_EXTRA='--gpus=1'

# this is optional and applies if cpu_only=True in config.yaml
export DEEPVARIANT_CPU_EXTRA='--exclusive'

representative error

[Thu Nov 17 16:00:34 2022]
rule smrtcell_stats_ubam:
    input: smrtcells/ready/PBG_3390_0000008833/m00000_000000_000000.hifi_reads.bam
    output: samples/PBG_3390_0000008833/smrtcell_stats/m00000_000000_000000.read_length_and_quality.tsv
    log: samples/PBG_3390_0000008833/logs/smrtcell_stats/m00000_000000_000000.log
    jobid: 51
    benchmark: samples/PBG_3390_0000008833/benchmarks/smrtcell_stats/m00000_000000_000000.tsv
    reason: Missing output files: samples/PBG_3390_0000008833/smrtcell_stats/m00000_000000_000000.read_length_and_quality.tsv
    wildcards: sample=PBG_3390_0000008833, movie=m00000_000000_000000
    resources: mem_mb=8000, disk_mb=196609, tmpdir=/tmp, account=$ACCOUNT, partition=$PARTITION, threads=1, out=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.out, err=./cluster_logs/lsf-$LSB_JOBNAME-$LSB_JOBID-$HOSTNAME.err, extra=

(python3 workflow/scripts/extract_read_length_and_qual.py smrtcells/ready/PBG_3390_0000008833/m00000_000000_000000.hifi_reads.bam > samples/PBG_3390_0000008833/smrtcell_stats/m00000_000000_000000.read_length_and_quality.tsv) > samples/PBG_3390_0000008833/logs/smrtcell_stats/m00000_000000_000000.log 2>&1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Project name must be specified using bsub -P.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Request aborted by esub. Job not submitted.
Error submitting jobscript (exit code 255):
lauragails commented 1 year ago

aaaah I bet I know what happened: the list of jobs was made prior to me fixing the things, so those lists weren't remade. That took a while before, and wasn't repeated. It probably has old flags/things on it.

I'm going to start everything from scratch with the new files and will let you know how it goes!

williamrowell commented 1 year ago

Edit: Just saw your previous message. We can save the message below in case there's still a problem.

Thanks for your patience. We're trying to support clusters that we don't have available for testing.

I'm wondering if the env variables aren't being interpolated.

First checked one of the top level snakemake logs to see how this looks for slurm:

[Mon Nov 14 15:47:26 2022]
rule samtools_fasta:
    input: smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam
    output: samples/HG005/jellyfish/m64017_200723_190224.fasta
    log: samples/HG005/logs/samtools/fasta/m64017_200723_190224.log
    jobid: 135
    benchmark: samples/HG005/benchmarks/samtools/fasta/m64017_200723_190224.tsv
    reason: Missing output files: samples/HG005/jellyfish/m64017_200723_190224.fasta
    wildcards: sample=HG005, movie=m64017_200723_190224
    threads: 4
    resources: mem_mb=32000, disk_mb=1000, tmpdir=/scratch, account=$ACCOUNT, partition=$PARTITION, threads=1, out=cluster_logs/slurm-%x-%j-%N.out, extra=

(samtools fasta -@ 3 smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam > samples/HG005/jellyfish/m64017_200723_190224.fasta) > samples/HG005/logs/samtools/fasta/m64017_200723_190224.log 2>&1
Submitted job 135 with external jobid 'Submitted batch job 32688205'.

So I still see the variables uninterpolated in the resources (e.g., $ACCOUNT), same as for yours. Next, looking at the job that was spawned by the process above, corresponding to batch job 32688205.

[Mon Nov 14 15:47:44 2022]
rule samtools_fasta:
    input: smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam
    output: samples/HG005/jellyfish/m64017_200723_190224.fasta
    log: samples/HG005/logs/samtools/fasta/m64017_200723_190224.log
    jobid: 0
    benchmark: samples/HG005/benchmarks/samtools/fasta/m64017_200723_190224.tsv
    wildcards: sample=HG005, movie=m64017_200723_190224
    threads: 4
    resources: mem_mb=32000, disk_mb=1000, tmpdir=/scratch, account=100humans, partition=compute, threads=1, out=cluster_logs/slurm-%x-%j-%N.out, extra=

(samtools fasta -@ 3 smrtcells/ready/HG005/m64017_200723_190224.hifi_reads.bam > samples/HG005/jellyfish/m64017_200723_190224.fasta) > samples/HG005/logs/samtools/fasta/m64017_200723_190224.log 2>&1
Activating conda environment: /pbi/flash/wrowell/testing/.snakemake/conda/252443608ffe510fcf2ff978d2d08708

The variables have been interpolated here. That's kind of confusing. I thought maybe it had something to do with single-quoting, but it seems like snakemake is still interpolating env variables in these strings.

What happens if we stop trying to be clever and set account and partition explicitly in workflow/profiles/lsf/config.yaml without using env variables?

...
default-resources:
  - account=acc_name
  - partition=premium
  - tmpdir=system_tmpdir
...
lauragails commented 1 year ago

I still got the error when I did that earlier, but it might be because the jobs need to be completely re-made from the get-go. Good idea RE hard-coding this in from square 1, which I'm doing now!

lauragails commented 1 year ago

Didn't work.

But stepping back, what benefit does the lsf wrapper get you vs. submitting a local script - ie process_smrtcells.local.sh - within an external bsub command (ie like I submit any other "regular" piece of software)?

I see from the tutorial that We recommend at least 80 cores and 1TB RAM for local execution. Local execution will use all available cores.

I'm running this overnight and hopefully it will get picked up!

williamrowell commented 1 year ago

The benefit of having snakemake submit jobs to the scheduler is that many processes can run in parallel. Each movie alignment job takes ~24 threads, so three movies would use nearly all available cores on that 80 core instance for ~1.5h. It won't matter as much for process_smrtcells, but at the process_sample level, there are a lot of jobs to schedule.

That being said, we've been running some jobs lately using process_sample.local.sh on a node with 256 cores + 512GB RAM + NVIDIA A30 and they've been completing within a day typically. We tweak a few settings to give a few jobs more threads.