Sequence Annotation search by external identifiers

macieksmuga commented 8 years ago

As seen in proposed usage patterns in the upcoming G2P endpoints (https://github.com/ohsu-computational-biology/schemas/blob/apichanges/doc/source/api/proposed_schema_changes.md) as well as the pending RNA API PR #630, there is a clear need for the features/search endpoint to support searching by feature names, or rather, by external identifiers (such as HUGO gene names, rs IDs, Ensembl IDs, etc.).

In the case of G2P, such an extension would negate the need for a separate endpoint, genotypes/search to effectively do the same as the above proposal. In the case of RNA, it would allow for importing RNA data with correct feature ID associations per expression level.

Here's a proposed change to the Feature schema and associated search endpoint to address this issue: https://gist.github.com/david4096/8ade1883089bede01ed18168ec489ba8

diekhans commented 8 years ago

I don't understand In [53] in the proposal. What does "synthesis with other data sets to be progressive." mean? What is a "federation of shared knowledge"?

david4096 commented 8 years ago

This is an aspirational feature of the field. The idea is that as the same feature is resolved in another database, a server operator can append that identifier to the record allowing common search patterns across server instances.

By progressive I mean that, if one server operator would like to synthesize with a novel or existing dataset, they can do so by appending identifier objects to features as they are found. The same identifier schemes biologists are accustomed to (ensembl, havana, refseq) would be used throughout the API.

This leads to a loose federation model where data curators control how their data are accessed by providing familiar entry points. I can ask database 1 and database 2 about their entry for external identifier X. My database could then provide external identifiers to database 1 and 2, and optionally X. If my database includes the external identifier X then I have shared the knowledge that X exists in database 1 and 2.

diekhans commented 8 years ago

I still have no idea what synthesizing a data set means or why this is progressive or how this implements federation.

What does external identifier mean in this context? That is, external to what?

pgrosu commented 8 years ago

+1 I definitely endorse this idea! Mark, I think David and Maciek just want to provide the flexibility of joining datasets by the multiple annotation elements, where the translation among annotation would intersect to common search terms such as genes and other search terms, where the pointer to the annotation source can be dynamic.

It's basically a way of performing a dynamic search where you have a translation table for one annotation type to another, and performing a query that way. You might remember, I wrote about a similar idea two years ago of associated inverted indices through hash (digests) at the following location:

https://github.com/ga4gh/schemas/issues/142#issuecomment-55518571

With a dynamic query example here:

https://github.com/ga4gh/schemas/issues/212#issuecomment-104710330

Hope it helps, Paul

andrewjesaitis commented 7 years ago

+1

I really like this idea as well. It will definitely move us toward a more federated model without having to implement/solve the data-dependent id problem.

Even in a world where we generate unique ids based on the record, I still can imagine the practical problem of people modifying their version of the feature database. Then when someone attempts to lookup that feature in the canonical db (rather than the authors), it either cannot be found when looked up by id (since the data supporting the id changed thus changing the id) or it isn't really what the author of the data was referring when looked up by commonly used id/name (provided by Ensembl or RefSeq). At least by providing external ids as tuples of db, version, and id, we can maintain referential integrity at the time of data creation.

david4096 commented 7 years ago

We might consider adding an external identifier message to RNA expression levels so a similar pattern of access is opened. The use case we would like to provide is being able to host an RNA expression table without having to have the Feature Set used to generate it locally available. Often times, one simply cares about the ensembl identifier. They can then construct a useful query against any Feature Set. @saupchurch @ejacox

kozbo commented 7 years ago

Aridhia followed this usage model in their GA4GH federation demonstration. When it came time to look up feature information on the variants found from the 5 different GA4GH servers, they used the 1kgenomes server's feature set. They knew that the feature names would resolve in the feature set because of the way they constructed the test. But the knowledge of how to resolve the identifiers resided in the client code and could not be discovered from the data representation in the servers.

Another point to be made is that genomic feature information doesn't change very frequently. It makes sense to have Feature set services available on the net such that instances of GA4GH servers wouldn't need to ingest their own copy. The current setup instructions we have posted for our reference server suggests loading the gencode annotations with each server instance. I don't think that is the correct model for commonly used, infrequently changing data.

diekhans commented 7 years ago

GENCODE human is updated once every 6 months, mouse once every 3 months. RefSeq full releases happen 3-4 times a year.

Analysis using to a given version of a gene annotation set needs to be able to access that specific version or the result will not be fully interpretable.

Have a federated system where annotations, genomes and other common resources could be accessed remotely would make setting up small servers easier.

Unfortunately, the GA4GH API was not designed to be a federated API. This would be a good thing to bring up as part of the requirements for the refactoring effort.

Kevin Osborn notifications@github.com writes:

Aridhia followed this usage model in their GA4GH federation demonstration. When it came time to look up feature information on the variants found from the 5 different GA4GH servers, they used the 1kgenomes server's feature set. They knew that the feature names would resolve in the feature set because of the way they constructed the test. But the knowledge of how to resolve the identifiers resided in the client code and could not be discovered from the data representation in the servers.

Another point to be made is that genomic feature information doesn't change very frequently. It makes sense to have Feature set services available on the net such that instances of GA4GH servers wouldn't need to ingest their own copy. The current setup instructions we have posted for our reference server suggests loading the gencode annotations with each server instance. I don't think that is the correct model for commonly used, infrequently changing data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

ga4gh / ga4gh-schemas

Sequence Annotation search by external identifiers #633