gagneurlab / drop

Pipeline to find aberrant events in RNA-Seq data, useful for diagnosis of rare disorders
MIT License
129 stars 43 forks source link

Issue with variant calling pipeline - not enough space #358

Closed ChiaraF32 closed 2 years ago

ChiaraF32 commented 2 years ago

Hi, I am trying to run the DROP variant calling pipeline on a group of around 30 skeletal muscle rnaseq data.

I am running the pipeline on a virtual machine with 16 CPU and 64 GB of RAM, OS Ubuntu 18.04

I keep running into issues with the pipeline writing tmp files to /tmp and apparently maxing out the available storage, causing the run to fail. In my set-up, the tmp directory is in a root volume with limited space (40 GB) so it would be preferable to write to a directory in my /data volume which has more space.

An example error I am getting is below:

Caused by: java.io.IOException: No space left on device
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
    at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
    at java.nio.channels.Channels.writeFully(Channels.java:101)
    at java.nio.channels.Channels.access$000(Channels.java:61)
    at java.nio.channels.Channels$1.write(Channels.java:174)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:360)
    at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:377)
    at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:130)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:220)
    ... 10 more

I first tried adding a default-resources specification to my config.yaml file (after reading this article). I also added the --default-resources flag into my run command:

snakemake rnaVariantCalling --cores 10 --default-resources --rerun-incomplete

However, this did not seem to alter what resources were being used in each step, so the pipeline failed again. See below for a copy of my config.yaml file.

projectTitle: "Muscle_Jul2022"
root: /data/drop/Muscle_Jul2022/output             # root directory of all intermediate output
htmlOutputPath: /data/drop/Muscle_Jul2022/output/html  # path for HTML rendered reports
indexWithFolderName: true # whether the root base name should be part of the index name
hpoFile: /data/drop/input/ref/hpo_genes.tsv.gz  # if null, downloads it from webserver 
sampleAnnotation: /data/drop/Muscle_Jul2022/data/sample_annotation.tsv # path to sample annotation (see documenation on how to create it)
geneAnnotation:
    # multiple annotations with custom names are possible
    # <annotation_name> : <path to gencode v19 annotation>
    v38: /data/references/iGenomes/Homo_sapiens/GENCODE/gencode.v39.annotation.gtf.gz # example
genomeAssembly: hg38  # either hg19/hs37d5 or hg38/GRCh38
exportCounts:
    # specify which gene annotations to include and which 
    # groups to exclude when exporting counts
    geneAnnotations:
      - v38
    excludeGroups:
      - null
genome: /data/references/iGenomes/Homo_sapiens/UCSC/hg38/drop/hg38.fa.fasta # path to genome sequence in fasta format. 
    # You can define multiple reference genomes in yaml format, ncbi: path_to_ncbi, ucsc: path_to_ucsc
    # the keywords that define the path should be in the GENOME column of the SA table
aberrantExpression:
    run: true
    groups: 
        - MUSCLE
    fpkmCutoff: 1 #suggested by outrider
    implementation: autoencoder
    padjCutoff: 0.05 #default
    zScoreCutoff: 0 #default; Z scores (in absolute value) greater than this cutoff are considered as outliers.
    maxTestedDimensionProportion: 3 #default
aberrantSplicing:
    run: true
    groups:
        - MUSCLE
    recount: true #default
    longRead: false #only required if data is longread
    keepNonStandardChrs: true
    filter: true #filter out jhunctions expressed at low levels if set to true
    minExpressionInOneSample: 20 #default; The minimal read count in at least one sample required for an intron to pass the filter.
    minDeltaPsi: 0.05 #default
    implementation: PCA
    padjCutoff: 0.05 #default
    zScoreCutoff: 0 #default
    deltaPsiCutoff: 0.3 #default; suggested by FRASER
    maxTestedDimensionProportion: 6 #default
rnaVariantCalling:
    run: true
    groups:
        - MUSCLE
    highQualityVCFs:
        - /data/drop/resources/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
        - /data/drop/resources/Homo_sapiens_assembly38.known_indels.vcf.gz
    dbSNP: /data/drop/resources/00-All.vcf.gz
    repeat_mask: /data/drop/resources/hg38_repeatMasker_sorted.bed
    addAF: true
    maxAF: 0.001
    maxVarFreqCohort: 0.1
    minAlt: 3
    hcArgs: 
mae:
    run: false
    groups:
        - MUSCLE
    gatkIgnoreHeaderCheck: true
    padjCutoff: 0.05
    allelicRatioCutoff: 0.8
    addAF: false
    maxAF: .001 #default
    maxVarFreqCohort: .05 #default
    # VCF-BAM matching
    qcVcf: /data/drop/input/ref/qc_vcf_1000G_hg19.vcf.gz # path to common variant file e.g. qc_vcf_1000G.vcf.gz
    qcGroups: mae
tools:
    gatkCmd: gatk
    bcftoolsCmd: bcftools
    samtoolsCmd: samtools
default-resources:
  - mem_mb=10000
  - tmpdir=/data/tmp_drop

This didn't work though I first had a look in the Snakefile which I found in a sub-directory of my project directory: /data/drop/Scripts/rnaVariantCalling/pipeline/Snakefile

I tried modifying the Snakefile manually for each step that uses a gatk tool by adding in the --TMP_DIR flag and specifying a directory within my /data partition, see attached Snakefile, and below for an example. This didn't seem to work either; the log files were still saying that the default /tmp directory was being used. I then tried deleting all occurrences of -java-options "-Djava.io.tmpdir={resources.tmpdir} on top of using the --TMP_DIR flag. This didn't work either, with the pipeline still writing files to /tmp.

 """
        gatk MarkDuplicates \
        -I {input.bam} -O {output.bam} \
        -M {params.metrics} --CREATE_INDEX true \
        --TMP_DIR /data/tmp_drop \
        --SORTING_COLLECTION_SIZE_RATIO 0.15 \
        --VALIDATION_STRINGENCY SILENT 2> {log}
        """

I feel a bit stuck as to what to do next. Maybe I am just missing something really obvious and simple! Any help would be very much appreciated. Snakefile.txt

nickhsmith commented 2 years ago

You should be able to use the original Snakefile and set the tmpdir flag on the command line. You have to wrap the tmpdir variable in double quotes.

Look at snakemake's resource section: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#standard-resources

snakemake rnaVariantCalling --cores 10 --default-resources "tmpdir='/data/tmp_drop'" --keep-going

EDIT In my test run the path/to/tmpdir must be the full absolute path, otherwise the R processing sections will fail.

Hopefully this works. If not I would reccommend changing the environment's $TMPDIR variable to align with this: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir

ChiaraF32 commented 2 years ago

Thanks for your help, it is now working so I will close this thread!