broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

GATK should have a tool (and backing method) that correctly splits multiallelic VariantContexts #4976

Open LeeTL1220 opened 6 years ago

LeeTL1220 commented 6 years ago

Feature request

Tool(s) or class(es) involved

Many....

Description

Currently, it is very difficult to split a multiallelic VariantContexts. There are components in GATK that do pieces of this operation, but no unified method and tool.

This issue is to create a method (for calling from any GATK tool) that will split the VariantContexts. As well as an example tool that we can use to create a completely biallelic VCF from a VCF with multiallelics.


Example

Header: INFO: AC_AMR, Number="A" INFO: DP_CNT, Number="R" INFO: NOTES, Type="String" FORMAT: GT Genotype

SAMPLE1 SAMPLE2

Input: VariantContext

SAMPLE1 GT 0/1 SAMPLE2 GT 0/2

Output: The list should have size of input's alt allele count.

VariantContext1

VariantContext2

Notes

See GATKVariantContextUtils and AlleleSubsettingUtils for pieces of this.

PL may be a special case, but there is code in the GATK for splitting this.

Open questions (so far): How to split when a sample GT is not 0/*... In the above example, how to split if SAMPLE2 GT was 1/2 ?

ldgauthier commented 6 years ago

This would go a lot of the way towards solving #4795

ldgauthier commented 6 years ago

I just stumbled across a het-non-ref (e.g. 1/2) splitting approach. There's a caller that represents overlapping ALT alleles with two sites, one with 1|. and the other .|1. I don't love this, but there's not going to be an elegant solution. I also have no idea what to do with the PLs in that scenario.

droazen commented 6 years ago

Assigning to @jamesemery to implement. This tool should be a VariantWalker

jamesemery commented 6 years ago

@LeeTL1220 Having started to implement this. I have a number of design questions that would be informed by your usecases.

Firstly, is there a reason to preserve symbolic alleles? It seems as though spanning deletions could/should be dropped as in most cases there is another variant context representing that deletion elsewhere in your file? Should there be validation around dropping spanning deletion symbolic alleles to ensure we aren't dropping a spanning deletion that isn't represented anywhere else? What about nocalls?

Your example suggests that we rely on the header line counts for subsetting annotations, if there is a disagreement in the header do you want any more sophisticated behavior than just throwing? My understanding is that we are lenient with splitting in htsjdk and there have been some mislabeled header lines in the past that would make this an expected state. Furthermore, most allele specific annotators are of type string because there is no standard for "|" delimiters which makes them hard to handle properly. @ldgauthier do you have any suggestions as to how to detect and handle allele specific annotations?

LeeTL1220 commented 6 years ago

@jamesemery Set up a meeting and we can go into more detail.