Issue

WGS pipeline can be noticeably time-consuming due to deep sequencing over the entire genome (~ 2 billion reads). It would be great to parallelize the post-alignment and variant calling process where it might be.

Approach

The parallelization can be done for the post-alignment and variant calling processes:

Post-alignment step: multi-threading approach - Markduplicate + BaseRecal + ApplyBaseRecal spark versions of GATK tools
Variant calling step: scatter-gather approach - the splitting of reference into pieces. E.g. Mutect2 tool can be run with lists of intervals to restrict operating on a subset of genomic regions.

Spark-enabled GATK tools

MarkDuplicatesSpark

gatk MarkDuplicatesSpark \
-I sorted_with_readgroup.bam \
-O output_marked_duplicates.bam \
-M marked_dup_metrics.txt \ # optional ?
--spark-runner SPARK \
--spark-master MASTER_URL

BaseRecalibratorSpark

gatk BaseRecalibratorSpark \
-I output_marked_duplicates.bam \
-R reference.fasta \
--known-sites sites_of_variation.vcf \
--known-sites setOfSitesToMask.vcf \
-O output_recal.table \
--spark-runner SPARK \
--spark-master MASTER_URL

ApplyBQSRSpark

gatk ApplyBQSRSpark \
-I output_marked_duplicates.bam \
-bqsr output_recal.table \
-O output_bqsr.bam \
--spark-runner SPARK \
--spark-master MASTER_URL

epam / fonda

WGS pipeline parallelization #202

Issue

Approach

Spark-enabled GATK tools