Specifying directory for singularity downloads

lauragails commented 1 year ago

Hello, Is it possible to select the location for downloading deepvariant? When I run it on our cluster, it defaults to my home which doesn't have that much space. I would prefer to put it in my "software" directory, for example.

Also, when I submit this as a job to LSF, I don't have internet access. So any pulling attempt will cause everything to fail. Is there a way to do a detect-if-not-there thing so that way it doesn't automatically download deepvariant?

Thank you!

Laura

williamrowell commented 1 year ago

There are a few things happening here.

# this is where singularity caches downloads when building an image
# by default, it's in ~/.singularity
# this is probably what's filling up your homedir quota
echo $SINGULARITY_CACHEDIR

# this is where singularity builds the image temporarily
# by default, it's /tmp
# sometimes HPCs use a different path for /tmp, like /scratch
echo $SINGULARITY_TMPDIR

# this is where the final image will be stored
# command below assumes you're calling this from the
# directory that contains workflow/, samples/, smrtcells/, etc
echo $PWD/.snakemake/singularity

You can't really change the third path, but you can change the first two. Make sure you have changed variables.env line 2 to point to a scratch directory with a lot of space. Then add export SINGULARITY_CACHEDIR="$TMPDIR" after line 6. Now SINGULARITY_CACHEDIR and SINGULARITY_TMPDIR will both point to the scratch directory you defined on line 2.

Before you do all of this, clean your current singularity cache with:

singularity cache clean -a

lauragails commented 1 year ago

Got it, thanks!

lauragails commented 1 year ago

FYI that flag didn't work, but this will (if I hit "yes")

singularity cache clean all

Is that the same command?

williamrowell commented 1 year ago

The second part of your question is more complicated.

It's possible to run conda on your login node (assuming it has internet connection) in a mode where it only downloads the conda environments.

snakemake ... --conda-create-envs-only

So you could temporarily add --conda-create-envs-only to the end of the snakemake command in the shell files (process_smrtcells.lsf.sh, process_sample.lsf.sh, process_cohort.lsf.sh), on the login node, to create all of the conda envs.

I'm still looking for the equivalent option to do this for singularity images.

williamrowell commented 1 year ago

If there's no other way to do this, you could manually pull/create the singularity images you need and change the snakemake rules to use your local copies.

export TMPDIR=/path/to/my/large/scratch/directory
export SINGULARITY_CACHEDIR=$TMPDIR
export SINGULARITY_TMPDIR=$TMPDIR

singularity pull docker://google/deepvariant:1.4.0    # or 1.4.0-gpu
# creates deepvariant_1.4.0.sif
singularity pull docker://ghcr.io/dnanexus-rnd/glnexus:v1.4.1
# creates glnexus_v1.4.1.sif

These will generate *.sif files. Move these to their permanent home and get the full absolute path to these images.

rules/sample_deepvariant.smk lines 23, 59, 88

    container: "/path/to/deepvariant_1.4.0.sif"

rules/cohort_glnexus.smk line 15

    container: "/path/to/glnexus_v1.4.1.sif"

lauragails commented 1 year ago

So far so good! It looks like process_samples is working now, but only a few jobs started so I'll keep you posted.

FYI it took a boatload of space to actually make the environment, and the deepvariant containers. I'm sure it had to do with sorting/intermediate file generation.

lauragails commented 1 year ago

So far deepvariant make_examples is working. But then I see an info message:

Activating singularity image /sc/arion/projects/buxbaj01a/software/pacbio_sifs/deepvariant_1.4.0.sif
^[[34mINFO:   ^[[0m Could not find any NVIDIA binaries on this host!

It doesn't look like an error and jobs are still running, but sharing in case this will cause a downstream issue/in case I'm missing some software.

williamrowell commented 1 year ago

This is a warning and not an error. Everything is still running correctly. It's basically saying, "you asked me to allow this singularity container to use the GPU, but there isn't an NVIDIA GPU on this node".

lauragails commented 1 year ago

Ok! Also memory question: When I ran deepvariant previously (not with the pipeline, through google's bucket/pipeline) I generated a ton of intermediate files and ran out of space.

I have ~20 samples, each submitted as its own job, and ~10TB free at the moment. Do you think I will run out of disk space?

Thank you for fielding all of my questions! In the meantime, I'm going to delete more files...

juniper-lake commented 1 year ago

Most intermediate files produced by DeepVariant are labeled as "temp" in the snakemake workflow and therefore deleted upon completion of the workflow. I think you'll be fine for disk space, but it's difficult to say anything with 100% certainty.

lauragails commented 1 year ago

I suspect we aren't getting all of the written files into user-specified directories. For example, I have the error (and others like it in different spots of the pipeline):

OSError: [Errno 28] No space left on device: 'samples/samplename/hifiasm'

The samples directory (and all subfolders where I'm executing the PB pipeline) should have ~20TB available (I currently see this).

However in my home directory, I only have ~10GB, so if packages/things get installed to any default place, that would throw a space issue.

Are there any other paths I can export to bash variables? Thank you!

lauragails commented 1 year ago

Good news! The no space left on device was an underlying parameter on the HPC end (solved by our HPC core):

They increased my inode limit to 8MM

lauragails commented 1 year ago

corollary follow-up: Things truncated immediately when this underlying issues across jobs happened, but I noticed that checks for the next step looks to see if a file exists. If a file was truncated mid-writing, will this throw an error message or would you recommend restarting from scratch?

williamrowell commented 1 year ago

If a file was truncated mid-writing, will this throw an error message or would you recommend restarting from scratch?

It really differs depending on which program was interrupted. The safest thing would be to wipe out incomplete jobs and start again.

lauragails commented 1 year ago

Got it. Also, I ran out of memory. Would you recommend running with more nodes and/or more memory in each node?

I ran with 15 cores, 5000 gb in each

Each sample was submitted with the same configuration, as a separate job. Thank you, Billy!

williamrowell commented 1 year ago

I ran with 15 cores, 5000 gb in each

I assume you mean 5000 MB?

We don't have a lot of experience running on single nodes (*.local.sh), but in our experience 256 threads and ~512GB have worked. Typically we're expecting ~ 4-8GB RAM / thread, but only a few tasks make use of the total expected memory requirements (e.g. hifiasm, postprocess_variants, alignment).

lauragails commented 1 year ago

aaah yes whoops.

That is totally doable, thanks for the guidance!

joey-lai commented 1 year ago

Hi I followed this fix on this issue as the HPC in our facility does not allow the compute nodes to access the internet to pull the deepvariant container.

I pulled and created the *.sif file with deepvariant version 1.5.0 instead and updated the lines 23, 59, 88 in rules/sample_deepvariant.smk.

If there's no other way to do this, you could manually pull/create the singularity images you need and change the snakemake rules to use your local copies.
export TMPDIR=/path/to/my/large/scratch/directory
export SINGULARITY_CACHEDIR=$TMPDIR
export SINGULARITY_TMPDIR=$TMPDIR

singularity pull docker://google/deepvariant:1.4.0    # or 1.4.0-gpu
# creates deepvariant_1.4.0.sif
singularity pull docker://ghcr.io/dnanexus-rnd/glnexus:v1.4.1
# creates glnexus_v1.4.1.sif
These will generate *.sif files. Move these to their permanent home and get the full absolute path to these images.

rules/sample_deepvariant.smk lines 23, 59, 88
    container: "/path/to/deepvariant_1.4.0.sif"
rules/cohort_glnexus.smk line 15
    container: "/path/to/glnexus_v1.4.1.sif"

snakemake ... --conda-create-envs-only was executed in my login node before submitting the slurm job. Singularity had to be module load in my slurm job script as well. What I got from the slurm job submission what the following.

rule deepvariant_make_examples:
    input: samples/HG002/aligned/m64012_190920_173625.GRCh38.bam, samples/HG002/aligned/m64012_190920_173625.GRCh38.bam.bai, reference/human_GRCh38_no_alt_analysis_set.fasta
    output: samples/HG002/deepvariant/examples/examples.tfrecord-00180-of-00256.gz, samples/HG002/deepvariant/examples/gvcf.tfrecord-00180-of-00256.gz
    log: samples/HG002/logs/deepvariant/make_examples/HG002.GRCh38.00180-of-00256.log
    jobid: 287
    benchmark: samples/HG002/benchmarks/deepvariant/make_examples/HG002.GRCh38.00180-of-00256.tsv
    reason: Missing output files: samples/HG002/deepvariant/examples/gvcf.tfrecord-00180-of-00256.gz, samples/HG002/deepvariant/examples/examples.tfrecord-00180-of-00256.gz
    wildcards: shard=00180
    resources: mem_mb=8000, disk_mb=10845, tmpdir=/tmp, account=$ACCOUNT, partition=$PARTITION, threads=1, out=cluster_logs/slurm-%x-%j-%N.out, extra=--constraint=avx512

        (/opt/deepvariant/bin/make_examples            
 --norealign_reads             --vsc_min_fraction_indels 0.12             --pileup_image_width 199             --track_ref_reads             --phase_reads             --partition_size=25000             --max_reads_per_partition=600             --alt_aligned_pileup=diff_channels             --add_hp_channel             --sort_by_haplotypes             --parse_sam_aux_fields             --min_mapping_quality=1             --mode calling             --ref reference/human_GRCh38_no_alt_analysis_set.fasta             --reads samples/HG002/aligned/m64012_190920_173625.GRCh38.bam             --examples samples/HG002/deepvariant/examples/examples.tfrecord@256.gz             --gvcf samples/HG002/deepvariant/examples/gvcf.tfrecord@256.gz             --task 00180) > samples/HG002/logs/deepvariant/make_examples/HG002.GRCh38.00180-of-00256.log 2>&1

sbatch: error: Batch job submission failed: Invalid feature specification
Error submitting jobscript (exit code 1):

It seems that slurm was unable to process deepvariant_make_examples and kept failing in each make_examples job.

williamrowell commented 1 year ago

sbatch is requesting --constraint=avx512, which isn't available on your cluster.

Modify this section of variables.env by commenting out line 16:

# deepvariant make_examples and postprocess_variants require at least avx2
# this is a workaround on our internal cluster, but it should almost always be commented out
export DEEPVARIANT_AVX2_CONSTRAINT='--constraint=avx512'

joey-lai commented 1 year ago

Thank you for the quick response!

It did solve the Error submitting jobscript (exit code 1), but the log still shows missing output files reason: Missing output files: samples/HG002/deepvariant/examples/gvcf.tfrecord-00124-of-00256.gz, samples/HG002/deepvariant/examples/examples.tfrecord-00124-of-00256.gz

williamrowell commented 1 year ago

This is not an error. This means that the reason that the deepvariant_make_examples task will be run is that these output files for that task don't already exist. Another reason for running a task might be that the input files have changed, for instance.

joey-lai commented 1 year ago

Got it. I was confused the cluster resource error with it. It turned out that I just had to tweak the max-threads: 256 in workflow/profiles/slurm/config.yaml for the cluster node.

What do you recommend to change in the config.yaml if our other cluster nodes have minimum requirements for available cores >1?

juniper-lake commented 1 year ago

What do you recommend to change in the config.yaml if our other cluster nodes have minimum requirements for available cores >1?

If I understand correctly, you have cluster nodes that only allow job requests with more than one 1 cpu. To change the default number of cpus for each job, you would increase the default number of threads in the config yaml under default resources, which is currently set to 1. https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake/blob/d21477b5c7b315627fc962b1f48964e9081ea5db/profiles/slurm/config.yaml#L23

Please let me know if this addresses your issue or if I misunderstood the problem.

joey-lai commented 1 year ago

I tweaked the default thread number - threads=1 and other parameters in the slurm script. Eventually I could push the slurm job to the cluster, but it would fail as a result. I think the error was coming from the thread requirement in each snakemake job in the .smk files in the workflow/rules/ directory. For example, the deepvariant_postprocess_variants requires 4 threads from its respective .smk file. We can check the min. thread usage in the beginning of a log. The job still used 4 threads for deepvariant_postprocess_variants, despite I had changed the - threads=1 to 57, which is the minimum cpu requirement for this particular cluster.

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 100
Job stats:
job                                 count    min threads    max threads
--------------------------------  -------  -------------  -------------
all                                     1              1              1
bcftools_concat_pbsv_vcf                1              1              1
bgzip_vcf                              48              2              2
calculate_sample_gc_coverage            1              1              1
deepvariant_bcftools_roh                1              1              1
deepvariant_bcftools_stats              1              4              4
deepvariant_call_variants               1            224            224
deepvariant_make_examples             163              1              1
deepvariant_postprocess_variants        1              4              4
last_align                              1             24             24
md5sum                                 28              1              1
merge_haplotagged_bams                  1              8              8
mosdepth                                1              4              4
pbsv_call                              22              8              8
pbsv_discover                          22              1              1
samtools_index_bam                      1              4              4
split_deepvariant_vcf                  25              1              1
tabix_vcf                              74              1              1
tandem_genotypes                        1              1              1
tandem_genotypes_absolute_count         1              1              1
tandem_genotypes_plot                   1              1              1
tandem_repeat_coverage_dropouts         1              1              1
trgt_coverage_dropouts                  1              1              1
trgt_genotype                           1             32             32
whatshap_bcftools_concat                1              1              1
whatshap_haplotag                       1              4              4
whatshap_phase                         25              1              1
whatshap_stats                          1              1              1
total                                 427              1            224

I think this issue is beyond the scope of this discussion. I should open another issue page.

You have been a great help! Thank you.

PacificBiosciences / pb-human-wgs-workflow-snakemake

Specifying directory for singularity downloads #142