ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Variant data model should be conceptual, not define in terms of VCF file format. #379

Open diekhans opened 9 years ago

diekhans commented 9 years ago

The variant data models is described in terms of VCF as opposed to a clear conceptual model, which are then related to VCF.

For instance, the statement: 'The variant set is equivalent to a VCF file.' lead to the interpenetration that a VCF split by chromosome should be multiple variant sets.

Transliterating the VCF format into JSON, as opposed to the VCF conceptual model has lead to a more complex and confusing API.

jacmarjorie commented 9 years ago

+1

Another confusing aspect that has come out of this is that a GAVariant has a String variantSetId, which according to the documentation states that this is the "ID of the variant set that this variant belongs to". If we think of the GAVariantsets, GAVariants, etc. in terms of the abstract matrix representation this would imply that if two callsets had the same exact variant, but originated from different variantSets, then this variant would exist as a duplicate with a different set of calls returned for each wrt the searchVariants function. If we want to have a variant object as a representation of a unique variant this becomes troublesome and leads to the development of database workaround functions - much like the mergeVariant function (https://cloud.google.com/genomics/v1beta2/reference/variantsets/mergeVariants).

To us, the String variantSetId on a GAVariant seemed like a consequence of designing the schema based on a file hierarchy. Regardless, it is then left to the database to logically merge / track which variants are identical and merge the calls into one upon returning a response - depressing some of the advantages of columnar store.

Though, this could potentially be it's own issue, but could be fixed with a new conceptual design; I'd suggest taking this into consideration when designing a more conceptual model.

dglazer commented 9 years ago

@diekhans , are you suggesting in this issue that we improve the documentation in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl, that we change the schema, or both? (Personally I agree the doc can use some work; I don't know of reasons to change the schema, but I'm open to suggestions.) Re doc, we might be able to borrow words from the Variants section of http://ga4gh.org/#/documentation, which is much less file-centric.

@jacmarjorie , I think you're suggesting that the API should be able to support a world where all calls from all samples are retrievable by a single searchVariants request -- is that right? If so, I think today's API already supports that, by letting you load all of your data into a single variantset. But it also lets you have different populations / studies, that are separately searched, if you want. Or am I misunderstanding?

diekhans commented 9 years ago

Hi @dglazer,

This is a documentation ticket. It has documentation label set so bioinformatician who is updating documentation will work on it when we get more bandwidth. She found the variant API perplexing.

We really needed to take careful look at the variant API. It has baggage from a difficult, compromise, file format that will be a burden going forwards. Particularly the fact that a variant can contain thousands of supporting calls is not forward-looking data model. This will not compose well, work with caching, and is mismatched to query languages.

A good rethinking when adding structural variation would be well advised, as we are going to have to live with this for a long time.

David Glazer notifications@github.com writes:

@diekhans , are you suggesting in this issue that we improve the documentation in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/ variants.avdl, that we change the schema, or both? (Personally I agree the doc can use some work; I don't know of reasons to change the schema, but I'm open to suggestions.) Re doc, we might be able to borrow words from the Variants section of http://ga4gh.org/#/documentation, which is much less file-centric.

@jacmarjorie , I think you're suggesting that the API should be able to support a world where all calls from all samples are retrievable by a single searchVariants request -- is that right? If so, I think today's API already supports that, by letting you load all of your data into a single variantset. But it also lets you have different populations / studies, that are separately searched, if you want. Or am I misunderstanding?

— Reply to this email directly or view it on GitHub.*

dglazer commented 9 years ago

Thanks @diekhans -- I missed the label. Happy to look at a doc pull request when ready; feel free to borrow from the ga4gh.org site, and/or from Google's documentation.

(And re rethinking the API itself -- sounds like that's a topic for another thread.)