Open LeeTL1220 opened 6 years ago
This would go a lot of the way towards solving #4795
I just stumbled across a het-non-ref (e.g. 1/2
) splitting approach. There's a caller that represents overlapping ALT alleles with two sites, one with 1|.
and the other .|1
. I don't love this, but there's not going to be an elegant solution. I also have no idea what to do with the PLs in that scenario.
Assigning to @jamesemery to implement. This tool should be a VariantWalker
@LeeTL1220 Having started to implement this. I have a number of design questions that would be informed by your usecases.
Firstly, is there a reason to preserve symbolic alleles? It seems as though spanning deletions could/should be dropped as in most cases there is another variant context representing that deletion elsewhere in your file? Should there be validation around dropping spanning deletion symbolic alleles to ensure we aren't dropping a spanning deletion that isn't represented anywhere else? What about nocalls?
Your example suggests that we rely on the header line counts for subsetting annotations, if there is a disagreement in the header do you want any more sophisticated behavior than just throwing? My understanding is that we are lenient with splitting in htsjdk and there have been some mislabeled header lines in the past that would make this an expected state. Furthermore, most allele specific annotators are of type string because there is no standard for "|" delimiters which makes them hard to handle properly. @ldgauthier do you have any suggestions as to how to detect and handle allele specific annotations?
@jamesemery Set up a meeting and we can go into more detail.
Feature request
Tool(s) or class(es) involved
Many....
Description
Currently, it is very difficult to split a multiallelic VariantContexts. There are components in GATK that do pieces of this operation, but no unified method and tool.
This issue is to create a method (for calling from any GATK tool) that will split the VariantContexts. As well as an example tool that we can use to create a completely biallelic VCF from a VCF with multiallelics.
Example
Header: INFO: AC_AMR, Number="A" INFO: DP_CNT, Number="R" INFO: NOTES, Type="String" FORMAT: GT Genotype
SAMPLE1 SAMPLE2
Input: VariantContext
Alleles:
"ACCAGGCCCAGCTCATGCTTCTTTGCAGCCTCT*" "TCCAGGCCCAGCTCATGCTTCTTTGCAGCCTCT" "A"
Attributes: AC_AMR=10,50 DP_HIST=20,30,50 NOTES="Foo,Bar,Baz"
SAMPLE1 GT 0/1 SAMPLE2 GT 0/2
Output: The list should have size of input's alt allele count.
VariantContext1
VariantContext2
Notes
See
GATKVariantContextUtils
andAlleleSubsettingUtils
for pieces of this.PL
may be a special case, but there is code in the GATK for splitting this.Open questions (so far): How to split when a sample GT is not 0/*... In the above example, how to split if SAMPLE2 GT was 1/2 ?