bigdatagenomics / bdg-formats

Open source formats for scalable genomic processing systems using Avro. Apache 2 licensed.
Apache License 2.0
38 stars 35 forks source link

Should `Genotype` allow for multiple `Variant`s / alleles? #33

Closed ryan-williams closed 7 years ago

ryan-williams commented 10 years ago

@arahuja implied there had been some discussion around this in the past.

AFAICT there is no good way right now to capture there being two non-reference alleles at one locus, having one Variant per Genotype.

In the immediate term I will work around this by emitting two lines / Variants in my VCFs, but I'm curious whether we should support the other way here. Thanks!

laserson commented 10 years ago

GenotypeAllele currently supports OtherAlt, but it's not clear to me when this would ever get selected, as Variant looks like it only supports documenting a single alternate allele. An alternative method would be for Variant to store all possible alleles at a given locus, but maybe that would be worse?

fnothaft commented 10 years ago

@ryan-williams @arahuja @laserson Ah! Yes, this one! So we made a decision in https://github.com/bigdatagenomics/adam/pull/157 to move to a biallelic model for variants, unlike VCF. The final PR to make this work was https://github.com/bigdatagenomics/adam/pull/222. I spoke with Richard Durbin about this a few months ago; he dubbed this "unary variants", and really liked the model.

The goal here, was that multi-allelic sites would be expressed by two different variant/genotype pairs. E.g., if I have a sample that is het A/T and reference is C, I then emit two variants/genotypes, where one is variant: C/A genotype: A/OtherRef and the other is variant: C/T genotype: OtherRef/T. In the germline case, this split clarifies a lot of the statistics that are collected (e.g., genotype likelihoods, alt depth, etc).

Either @mlinderm or @nealsid were the brains behind this move. CCing them for comments if they're free.

I'm not sure how this impacts somatic variant calling; it makes the statistic collection a lot less ambiguous for germline VC. @tdanford and I will be looking at somatic VC soon, so I'd be interested to hear your thoughts, @ryan-williams and @arahuja.

heuermh commented 7 years ago

Ping for an update. Is there anything here to consider for #108?

fnothaft commented 7 years ago

I don't think there is. I'd be fine closing this.

heuermh commented 7 years ago

Closing as WontFix