ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Variant annotation update (transcript effects endpoint) #706

Closed david4096 closed 1 year ago

david4096 commented 8 years ago

The current variant annotation message structure faces an issue seen in other parts of the API (readgroupsets, variants). Messages returned from a search request may be modified during a query, causing the same identifier to point to two unique documents. This can cause problems in interchange. It also creates a complex nested document that can be difficult to work with in practice.

To solve this, another endpoint has been added for filtering transcript effects. This allows transcript effects to be filtered and operated on directly, as opposed to through the modification of Variant Annotation messages at query time. It also provides a more natural query interface by separating concerns from the SearchVariantAnnotationsRequest.

Another problem facing the Variant Annotation protocol was the difficulty in finding an annotation for a given variant. It isn't possible to directly get an annotation for a given variant, which has caused clients to have to search for similar ranges for a given variant in an annotation set. The addition of the variant_id to the SearchVariantAnnotationsRequest makes it possible to find annotations directly, when a variant_id is provided the range is ignored.

By request from the VATT, filtering annotations by allele frequency has been added. This was done by adding an allele_frequency field to the VariantAnnotation message and a frequency_threshold to the SearchVariantAnnotationsRequest. @reece @sarahhunt @kusdhill

sarahhunt commented 8 years ago

@david4096 - thanks for putting back the search by variant and feature options!

Directly querying TranscriptEffects makes sense for feature id and location, but I’m not sure about searching by variant_annotation_id. The only way to extract a variant_annotation_id is by looking up the VariantAnnotation record which contains the TranscriptEffects anyway. Searching by variant id would be more useful, though we don’t have variant id in TranscriptEffect.

Similarly, if you find all TranscriptEffects for a gene, you then need to look up all VariantAnnotation records to find the variant id, picking up all the TranscriptEffects again.

The desire to not modify the variant annotation record may be better met by not having the record to group results by variant and leaving the client to handle this. What do you think?

david4096 commented 8 years ago

Thanks @sarahhunt excellent points! I have removed the transcript_effects field from the VariantAnnotation message and added a variant_id to the TranscriptEffect message. This will allow one to easily move back and forth between the protocols.

david4096 commented 7 years ago

Helps with https://github.com/ga4gh/server/issues/308

sarahhunt commented 7 years ago

@david4096 - I've been meaning to get back to this, sorry for the delay. Looking at the proposed changes, I am doubting the value of the variantAnnotation record and thinking it would be cleaner to just query for TranscriptEffects directly based on variant id, transcript id or location. What do you think?

david4096 commented 7 years ago

Thanks @sarahhunt !

The main reason for keeping the VariantAnnotation record are for "variant level" annotations, which for the time being is Allele Frequency. Perhaps we could move those features to the Variants API instead?

Can you tell me if I'm mistaken, but in practice if you have a Features API that allows you to get transcript effect IDs, can't you use that to construct a range request? I have been thinking that feature ID filtering of an endpoint might be fulfilled already by that access pattern. Like, if I want to find transcript effects for some ID that I have, I request that from the features API with a GetFeatureRequest, then use the returned location data to construct a range request against /transcripteffects.

Thanks for taking a look! I'm open to deprecating the Variant Annotation message. I think that with something like https://github.com/ga4gh/schemas/pull/700 it might be possible to think about how to construct requests against typed annotations. For now, I think the most important point thing is to remove the modification at query time of transcript effect messages.

sarahhunt commented 7 years ago

Hi @david4096!

We are currently using the VariantAnnotation record for allele frequency and ClinVar clinical significance info, but the first is a property of the variant and the second fits better as a seperate search for phenotype associations. It would be interesting to know how anyone else is using it.

Searching by featureIds rather than just ranges is useful because many features overlap. If you are interested in a specific transcript, there could be a dozen in the same range, so only supporting range queries could mean more records than are required are returned and force the filtering onto the API user. There is also a nice consistency if you can throw a set of featureIds at the various endpoints and retrieve different types of data.

Tightening up the info structure as in #700 makes a lot of sense.

david4096 commented 7 years ago

Thanks for the input regarding feature search. I'll leave it as is then. If there is a way I can push this PR forward without having to make modifications to the variants model, I'll take it. I believe we are going to need to switch to a variant model with one alternate base entry per variant and my hope is that by separating out the messages now we will be better prepared.