Time and file size optimizations

elixir-no-nels / rbFlow-Germline

A workflow engine with a germline calling pipeline running in a container

MIT License

0 stars 0 forks source link

Time and file size optimizations #16

Open oskarvid opened 6 years ago

oskarvid commented 6 years ago

In order to improve the current execution time and file footprint, which is 49 hours and 45 minutes with GATK 4.0.8.1 and the standard NA12878 fastq input files, there are a couple of possibles paths to take going forward.

1) HaplotypeCaller in the 4.0.2.0 version took ~25 hours to run while the 4.0.8.1 version took ~31 hours, using the --new-qual argument for HaplotypeCaller is supposed to save a lot of time.

2) Use vcf.gz output files to decrease the file footprint.

3) Max the number of cores and RAM that one execute node on Colossus can offer, currently 20 cores and 60GB RAM.

4) Use ReadsPipelineSpark: Takes unaligned or aligned reads and runs BWA (if specified), MarkDuplicates, BQSR, and HaplotypeCaller to generate a VCF file of variants. This tool is currently in beta.
This is presumably more optimized than running the individual tools in a pipeline.

oskarvid commented 6 years ago

Comment on the 3rd point: I tried running rbFlowLite with GATK 4.0.8.1, 20 threads and 55GB RAM for the JVM on a 60GB node. For some reason this bug occurred: https://gatkforums.broadinstitute.org/gatk/discussion/7042/what-arrayindexoutofboundsexception-means-and-what-to-do-about-it It did not occurr when I used 16 threads and 50GB RAM for the JVM on a 60GB node, so maybe there is some memory issue or something.

I think it's still worth it to explore this point further, perhaps this isn't an issue when the --new-qual flag is used for instance. But in case this bug pops up again I have at least documented it now.

GhisF commented 6 years ago

GATK ReadsPipelineSpark 4.0.1.2 tested on a local server

3 Exome Samples (Genome in a Bottle) : NA12877, NA12878, NA12880

gatk ReadsPipelineSpark \
  --java-options "-Xmx40G" \
  --spark-runner LOCAL \
  --spark-master local[4] \
  -I ${INPUTS} \
  -R ${TWO_BITS_INDEX} \
  --known-sites ${DBSNIP} \
  --known-sites ${MILLSGOLDSTD} \
  --use-new-qual-calculator \
  -O ${OUTPUT}

Fail if align is activated
Success from mapped/sorted bam files
Average time to complete the calling is 90 min (very close to the normal pipeline)

oskarvid commented 5 years ago

I ran a benchmark with GATK 4.0.8.1 and the NA12878 input files on Colossus 2.0, in summary it took 43 hours and 2 minutes, the complete output file size is 327 GB.

oskarvid commented 5 years ago

The execution time has so far varied from 43 hours, 45 hours and over 50 hours over three benchmarks with no changes between the benchmarks.

oskarvid commented 5 years ago

The current execution with 16 cores and 60 GB RAM on Colossus is over 55 hours. The last run stopped due to hitting the time limit while running HaplotypeCaller. This happened because HaplotypeCaller hadn't finished after running for 33 hours. The previous execution time for HaplotypeCaller was roughly 19h 30 min.
As a side note bwa took 2 hours longer than before, this is strange because bwa saw a ~2h speedup in GVCP, the difference between the two workflows is that bwa for rbFlow is version 0.7.17 and samtools is version 1.7 while GVCP has versions 0.7.15 and 1.3.1 respectively. If anything bwa in rbFlow was possibly faster than in GVCP with Colossus 2.0.