Clinical-Genomics / raredisease

CG's rare disease pipeline in next flow, see the main repo here 👇
https://github.com/nf-core/raredisease
MIT License
5 stars 1 forks source link

Java memory issue on SLURM #2

Closed jemten closed 1 year ago

jemten commented 2 years ago

Check Documentation

I have checked the following places for your error:

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

  1. nextflow run nf-core/raredisease -profile test,singularity,hasta,dev_prio -r dev (-c customconf.conf )
  2. See error:

Without customconf.conf

[dd/f36687] NOTE: Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134) -- Execution is retried (1)
WARN: Input tuple does not match input set cardinality declared by process `NFCORE_RAREDISEASE:RAREDISEASE:DEEPVARIANT_CALLER:GLNEXUS` -- offending value: [id:caseydonkey]
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'

Caused by:
  Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134)

Command executed:

  picard \
      -Xmx6g \
      MarkDuplicates \
      --CREATE_INDEX \
      -I 1234N.bam \
      -O 1234N_sorted.bam \
      -M 1234N_sorted.MarkDuplicates.metrics.txt

  cat <<-END_VERSIONS > versions.yml
  MARKDUPLICATES:
      markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  134

Command output:
  #
  # A fatal error has been detected by the Java Runtime Environment:
  #
  #  Internal Error (g1PageBasedVirtualSpace.cpp:43), pid=211157, tid=211219
  #  guarantee(rs.is_reserved()) failed: Given reserved space must have been reserved already.
  #
  # JRE version:  (11.0.9.1) (build )
  # Java VM: OpenJDK 64-Bit Server VM (11.0.9.1-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
  # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
  #
  # An error report file with more information is saved as:
  # hs_err_pid211157.log
  #
  #

Command error:
  /usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
  /usr/local/bin/picard: line 66: 211157 Aborted                 /usr/local/bin/java -Xmx6g -jar /usr/local/share/picard-2.25.7-0/picard.jar MarkDuplicates "--CREATE_INDEX" "-I" "1234N.bam" "-O" "1234N_sorted.bam" "-M" "1234N_sorted.MarkDuplicates.metrics.txt"

With customconf.conf:

process {
    withName: PICARD_MARKDUPLICATES {
        memory = 5.GB
    }
}
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'

Caused by:
  Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (1)

Command executed:

  picard \
      -Xmx5g \
      MarkDuplicates \
      --CREATE_INDEX \
      -I 1234N.bam \
      -O 1234N_sorted.bam \
      -M 1234N_sorted.MarkDuplicates.metrics.txt

  cat <<-END_VERSIONS > versions.yml
  MARKDUPLICATES:
      markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  1

Command output:
  Error occurred during initialization of VM
  Could not reserve enough space for 5242880KB object heap

Command error:
  /usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory

Expected behaviour

Successful completion of the analysis

Log files

Have you provided the following extra information/files:

System

Nextflow Installation

Container engine

Quick fix that solves the problem until more elegant solution

modules/nf-core/modules/picard/markduplicates/main.nf: avail_mem = task.memory.giga-2

Next related issue

Similar error for bamqc.

Additional context

For the first error, markduplicates:

nextflow-customconf.log nextflow-no-customconf.log

jemten commented 2 years ago

transferred from https://github.com/nf-core/raredisease/issues/53

rannick commented 2 years ago

Adding: process { clusterOptions = { task.memory ? "-l h_vmem=${task.memory.bytes/task.cpus}+2" : null } } to hasta.config runs through picard, but not qualimap/bamqc. Increasing memory to 40.GB for QUALIMAP_BAMQC allows to run through qualimap/bamqc

projectoriented commented 2 years ago

@rannick https://github.com/nf-core/configs/blob/master/conf/pipeline/raredisease/hasta.config 😸 - here it's specific to the pipeline for the server

We should put the modifications with the withName selector ☝️

https://github.com/nf-core/configs/blob/master/conf/hasta.config - on a universal server level 😆

projectoriented commented 2 years ago

Used this command: nextflow run . -profile hasta,dev_prio --genome GRCh38 --input ../samplesheet_justhusky.csv --local_genomes ../references/ --igenomes_ignore

[6c/42be36] process > NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (earlycasualcaiman)                          [100%] 6 of 6, failed: 3, retries: 3 ✔
[dc/89b014] process > NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:SAMTOOLS_INDEX_MD (hugelymodelbat)                          [100%] 1 of 1
[e9/199494] process > NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:PICARD_COLLECTMULTIPLEMETRICS (hugelymodelbat)                     [ 50%] 1 of 2, failed: 1, retries: 1
[f2/bc2a13] process > NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:CAT_CAT_BAIT (1)                                                   [100%] 1 of 1 ✔
[02/195aab] process > NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:PICARD_COLLECTHSMETRICS (hugelymodelbat)                           [100%] 1 of 1, failed: 1 ✘
[78/efc0fc] process > NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:QUALIMAP_BAMQC (hugelymodelbat)                                    [100%] 1 of 1, failed: 1
[d3/c59b81] process > NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:TIDDIT_COV (hugelymodelbat)                                        [100%] 1 of 1
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:UCSC_WIGTOBIGWIG                                                   -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:MOSDEPTH                                                           -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_REPEAT_EXPANSIONS:EXPANSIONHUNTER                                    -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_SNV_DEEPVARIANT:DEEPVARIANT                                          -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_SNV_DEEPVARIANT:GLNEXUS                                              -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_SNV_DEEPVARIANT:SPLIT_MULTIALLELICS_GL                               -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_SNV_DEEPVARIANT:REMOVE_DUPLICATES_GL                                 -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_SNV_DEEPVARIANT:TABIX_GL                                             -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_STRUCTURAL_VARIANTS:CALL_SV_MANTA:MANTA                              -
[ae/19f33f] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_STRUCTURAL_VARIANTS:CALL_SV_TIDDIT:TIDDIT_SV (hugelymodelbat)        [100%] 1 of 1
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_STRUCTURAL_VARIANTS:CALL_SV_TIDDIT:SVDB_MERGE_TIDDIT                 -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CALL_STRUCTURAL_VARIANTS:SVDB_MERGE                                       -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:ANNOTATE_VCFANNO:VCFANNO                                                  -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:CUSTOM_DUMPSOFTWAREVERSIONS                                               -
[-        ] process > NFCORE_RAREDISEASE:RAREDISEASE:MULTIQC                                                                   -
-[nf-core/raredisease] Pipeline completed with errors-
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:QUALIMAP_BAMQC (hugelymodelbat)'

Caused by:
  Process `NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:QUALIMAP_BAMQC (hugelymodelbat)` terminated with an error exit status (127)

Command executed:

  unset DISPLAY
  mkdir tmp
  export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp
  qualimap \
      --java-mem-size=36G \
      bamqc \
       \
      -bam hugelymodelbat_sorted_md.bam \
       \
      -p non-strand-specific \
      --collect-overlap-pairs \
      -outdir hugelymodelbat \
      -nt 6

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RAREDISEASE:RAREDISEASE:QC_BAM:QUALIMAP_BAMQC":
      qualimap: $(echo $(qualimap 2>&1) | sed 's/^.*QualiMap v.//; s/Built.*$//')
  END_VERSIONS

Command exit status:
  127

Command output:
  Java memory size is set to 36G
  Launching application...

  QualiMap v.2.2.2-dev
  Built on 2019-11-11 14:05

  Selected tool: bamqc
  Available memory (Mb): 33
  Max memory (Mb): 38654
  Starting bam qc....
  Loading sam header...
  Loading locator...
  Loading reference...
  Number of windows: 400, effective number of windows: 594
  Chunk of reads size: 1000
  Number of threads: 6

Command error:
  Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=./tmp
  cannot allocate memory for thread-local data: ABORT
raysloks commented 2 years ago

Hi! Using this config I managed to successfully run the entire pipeline:

process {
    withName: QUALIMAP_BAMQC {
        memory = 65535.MB
        cpus = 2
    }
    clusterOptions = { task.memory ? "-l h_vmem=${task.memory.bytes/task.cpus}+2" : null }
}

Before that I used this line in the modules themselves:

avail_mem = task.memory.giga * 14 / 15 as long

Other than that I've only tested the following for bamqc, which did NOT work:

withName: QUALIMAP_BAMQC {
    memory = 30000.MB
}
raysloks commented 2 years ago

This config also works, and is a little bit less hacky than the previous approach.

process {
    withName: QUALIMAP_BAMQC {
        memory = 64.GB
        ext.args = "--java-mem-size=60G"
    }
    clusterOptions = { task.memory ? "-l h_vmem=${task.memory.bytes/task.cpus}+2" : null }
}

I assume adding it here in the current config would look something like this:

process {
    executor = 'slurm'
    clusterOptions = {
        "-A $params.priority ${params.clusterOptions ?: ''}"
        task.memory ? "-l h_vmem=${task.memory.bytes/task.cpus}+2" : null
    }
    withName: QUALIMAP_BAMQC {
        memory = 64.GB
        ext.args = "--java-mem-size=60G"
    }
}

I think it'd be prudent to not hard-code the memory values like that, but I haven't found a decent way to avoid it yet.

raysloks commented 2 years ago

Update: Further testing suggests setting the -l hvem flag via clusterOptions does not actually do anything.

Directly telling slurm to allocate more memory via clusterOptions seems to work:

process {
    withName:'MARKDUPLICATES' {
        clusterOptions = { task.memory ? "--mem ${task.memory.mega * 1.15 as long}M" : null }
    }
}

Will make a PR to the configs repo when I know how to best integrate it with the already existing clusterOptions.