genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

Limit GATK HaplotypeCaller to provided interval_list with padding #924

Open jasonwalker80 opened 4 years ago

jasonwalker80 commented 4 years ago

The germline WES workflow specifically should be limited to regions of interest in the form of an input interval_list. I believe the interval_list is already an input, but: A) the interval_list should be the target_regions B) as implemented in somatic pipelines, we should have a "padding" input to include splice sites

apaul7 commented 3 years ago

I ran into a few problems which have caused delays in getting this issue resolved. I've written out some details for what problems I faced and what my solutions were. The biggest was that the current germline exome and wgs pipelines use the same detect variants subworkflows. This has caused the exome pipeline to call variants across the entire genome and then filter. To speed up the exome we can run gatk over just those target intervals.

just an overview the main steps that could be changed for the issue of wgs vs exome to provide better context on the issues:

current wgs input to detect variants: intervals: 1-22,X,Y limit_variant_intervals: variant_reporting_intervals interval list file

current exome input to detect variants: intervals: 1-22,X,Y limit_variant_intervals: target_intervals interval list file

The goal is for the exome workflow, is that the gatk step only calls the regions in the target_intervals file. The goal is for the wgs workflow, is that the gatk will continue calling over the entire genome, then produce a 'full' filtered vcf and a final vcf. The final vcf will be limited to the variant_reporting_intervals(unless this limited final vcf is no longer wanted)

Both the exome and the wgs call the same germline_detect_variants.cwl. This detect variants subworkflow then calls the same germline_filter_vcf.cwl subworkflow.

The initial idea I had was to have the tools/gatk_haplotype_caller.cwl step be provided an optional interval list file. This interval list file would not be provided when the wgs workflow calls subworkflows/germline_detect_variants.cwl. This would be provided when the exome workflow calls the subworkflows/germline_detect_variants.cwl.

Then set the --interval-set-rule to INTERSECTION. This would allow the haplotype_caller(subworkflow/gatk_haplotypecaller_iterator.cwl) step to still scatter jobs over the intervals(string[][]). This is because each scattered job would have at least two interval inputs. for exomes something like: --intervals chr1 --intervals target.intervals_list

I was hoping to actually change the intervals(string[][]) input to scatter_intervals to better define what the input actually does.

One downside for this approach is it would kind of duplicate work for the exome workflow. The subworkflow/germline_filter_vcf.cwl will still run the select variants step. This shouldn't actual remove any variants. The same interval list files would be used for both tools.

I tested this out on a small test set. Some of the scattered jobs kept failing during the haplotype_caller step. I figured out that the it was only for specific scatterd jobs. The issue was that the input interval_list file would not have any regions for some chromosomes. so when the scatter step 8 produced --intervals chr8 --intervals no_chr8_intervals.interval_list leading to error:

A USER ERROR has occurred: Argument -L, --interval-set-rule has a bad value: [chr8,expanded.interval_list],INTERSECTION. The specified intervals had an empty intersection

I always thought I would get back into creating separate exome/wgs detect variants subworkflows that ran the necessary steps. Just never got enough time to do it. After our discussions today I do believe we should take another approach. We can change the way germline_detect_variants.cwl runs. Provide 2 interval_list files. First being a scatter interval list, to run scatter jobs during the haplotype caller step. Second being the region_of_interest. The proposed updated inputs would be wgs input to detect variants: scatter_intervals_list: 1-22,X,Y(as a interval_list file) region_of_interest: variant_reporting_intervals interval list file

exome input to detect variants: scatter_intervals_list: target_intervals interval list file region_of_interest: target_intervals interval list file

We would have to add a step to split the input scatter_intervals_list file to scatter_count many files for actual job scattering. Also update or add a new subworkflows/gatk_haplotypecaller_iterator.cwl that scatters over an array of files instead of array of strings. This would not use the --interval-set-rule option. This is because each job should only get a single --intervals split-$n.interval_list input.

This will also allow us to make another pr to easily add a interval_list_padding step