Closed ryan-williams closed 7 years ago
GenotypeAllele
currently supports OtherAlt
, but it's not clear to me when this would ever get selected, as Variant
looks like it only supports documenting a single alternate allele. An alternative method would be for Variant
to store all possible alleles at a given locus, but maybe that would be worse?
@ryan-williams @arahuja @laserson Ah! Yes, this one! So we made a decision in https://github.com/bigdatagenomics/adam/pull/157 to move to a biallelic model for variants, unlike VCF. The final PR to make this work was https://github.com/bigdatagenomics/adam/pull/222. I spoke with Richard Durbin about this a few months ago; he dubbed this "unary variants", and really liked the model.
The goal here, was that multi-allelic sites would be expressed by two different variant/genotype pairs. E.g., if I have a sample that is het A/T and reference is C, I then emit two variants/genotypes, where one is variant: C/A genotype: A/OtherRef and the other is variant: C/T genotype: OtherRef/T. In the germline case, this split clarifies a lot of the statistics that are collected (e.g., genotype likelihoods, alt depth, etc).
Either @mlinderm or @nealsid were the brains behind this move. CCing them for comments if they're free.
I'm not sure how this impacts somatic variant calling; it makes the statistic collection a lot less ambiguous for germline VC. @tdanford and I will be looking at somatic VC soon, so I'd be interested to hear your thoughts, @ryan-williams and @arahuja.
Ping for an update. Is there anything here to consider for #108?
I don't think there is. I'd be fine closing this.
Closing as WontFix
@arahuja implied there had been some discussion around this in the past.
AFAICT there is no good way right now to capture there being two non-reference alleles at one locus, having one
Variant
perGenotype
.In the immediate term I will work around this by emitting two lines /
Variant
s in my VCFs, but I'm curious whether we should support the other way here. Thanks!