Search by external identifiers

david4096 commented 7 years ago

This makes our approach to naming more representative of the underlying data, making "light" referential integrity for many message types. For example, a variant might be named rs123445 in dbSNP, and so we provide the external identifier, database: dbsnp id: rs123445. This allows one to then query for variants using an external naming scheme. This use case is currently filled by cleverly naming items.

Many types have had external identifier fields made available, though they are optional. An external identifier query returns any items that match all of the provided fields. If only database: dbsnp is provide, variants are filtered by those tagged with external identifiers with that field matched exactly.

The same pattern is available to RNA expression levels, which this protocol change is meant to support. Expression levels are often given gene names or ensembl identifiers, and expecting data preparers to generate data sets which uniquely identify items in a GA4GH feature set is prohibitive for common usage patterns.

With this PR, instead of naming expression levels by their feature name and trying to find them in a feature set, it will be possible to more properly return all items that match an external identifier query.

Discovering which external identifier databases are being provided by a dataset will be a helpful future addition. It would also be helpful to add examples specific to the message type where available.

Close #633

sarahhunt commented 7 years ago

Being able to search by external id will be very useful, but the ExternalIdentifier message is ambiguous. How can you tell if the version refers to the database or the entry? Database versions tend to be higher than entry versions, but this cannot be guaranteed. Separate attributes would be clearer.

david4096 commented 7 years ago

Thanks @sarahhunt included the entry_version in the message!

david4096 commented 7 years ago

@andrewjesaitis Thanks! Left off the field from PhenotypeInstance since G2P has its own use of the feature that I didn't want to complicate this PR with. We can circle back if we like the way it works.

You are absolutely correct that the Ontology Term fields would become redundant with this! Those source fields are slated to be removed since the CURIE itself should be enough to provide it (https://github.com/ga4gh/schemas/pull/694/files#diff-25e5013485a6b83f4f09fd8bb3e8693aL20). I think that it's a good idea to keep these use cases separate for now, since Ontology Terms have a good restricted vocabulary and we are bootstrapping one with the External Identifier scheme.

ga4gh / ga4gh-schemas

Search by external identifiers #761