ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

Multiallelic, Binary, or Unary Variants for 1.0? #438

Open adamnovak opened 9 years ago

adamnovak commented 9 years ago

As per @richarddurbin 's discussion today, which of these models do we want to use for representing variants?

Right now, we (and VCF) have a multiallelic model. Many databases, however, like Genomics England, seem to be using a binary model, where multiallelic variants are broken up into several ref/alt or ref/alt/other records, with incompatibility being implied by the reference intervals given, or perhaps specified by annotation. There have also been proposals from the graph end of things to adopt a unary model, where every allele is called with a count, and where alleles are all annotated with incompatibility information.

Which of these do we want to use for our 1.0 release?

richarddurbin commented 9 years ago

I see the key step as being from going from a Variant record per set of alternate alleles, which are implicitly mutually incompatible to a Variant record for each alternate allele. Whether it is binary, carrying information about the genotypes with reference and other, or unary, just carrying information about the presence or absence of the alternate allele is not so relevant.

I think that the top level things should be the alternate alleles not sites. So I want to move to the allelic representation, in the first instance binary.

Apologies that I did not articulate this well.

Richard

On 13 Oct 2015, at 17:15, adamnovak notifications@github.com wrote:

As per @richarddurbin https://github.com/richarddurbin 's discussion today, which of these models do we want to use for representing variants?

Right now, we (and VCF) have a multiallelic model. Many databases, however, like Genomics England, seem to be using a binary model, where multiallelic variants are broken up into several ref/alt or ref/alt/other records, with incompatibility being implied by the reference intervals given, or perhaps specified by annotation. There have also been proposals from the graph end of things to adopt a unary model, where every allele is called with a count, and where alleles are all annotated with incompatibility information.

Which of these do we want to use for our 1.0 release?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/438.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

diekhans commented 9 years ago

It's very important to prototype any change in the model before codifying it in the API.

adamnovak notifications@github.com writes:

As per @richarddurbin 's discussion today, which of these models do we want to use for representing variants?

Right now, we (and VCF) have a multiallelic model. Many databases, however, like Genomics England, seem to be using a binary model, where multiallelic variants are broken up into several ref/alt or ref/alt/other records, with incompatibility being implied by the reference intervals given, or perhaps specified by annotation. There have also been proposals from the graph end of things to adopt a unary model, where every allele is called with a count, and where alleles are all annotated with incompatibility information.

Which of these do we want to use for our 1.0 release?

— Reply to this email directly or view it on GitHub.*

haussler commented 9 years ago

This is an extremely important one. We can't let its scope totally scare us off. Richard if you had somebody to help you on your end we could walk through thiscarefully I think, without having if drop if you get pulled away for a time in the middle of the process.

I totally agree, the top level things should be the alternate alleles not sites. We had no dissent on that basic point at the NYGC meeting. -D

On Tue, Oct 13, 2015 at 1:32 PM, Mark Diekhans notifications@github.com wrote:

It's very important to prototype any change in the model before codifying it in the API.

adamnovak notifications@github.com writes:

As per @richarddurbin 's discussion today, which of these models do we want to use for representing variants?

Right now, we (and VCF) have a multiallelic model. Many databases, however, like Genomics England, seem to be using a binary model, where multiallelic variants are broken up into several ref/alt or ref/alt/other records, with incompatibility being implied by the reference intervals given, or perhaps specified by annotation. There have also been proposals from the graph end of things to adopt a unary model, where every allele is called with a count, and where alleles are all annotated with incompatibility information.

Which of these do we want to use for our 1.0 release?

— Reply to this email directly or view it on GitHub.*

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/438#issuecomment-147843706.

david4096 commented 7 years ago

Related https://github.com/ga4gh/schemas/issues/754