ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

ReadAlignment, Variant, and Feature link to Reference by name rather than id #593

Open diekhans opened 8 years ago

diekhans commented 8 years ago

The design pattern in the API is that object linkage is done by id, not by name.

The Position object uses referenceName rather than referenceId.

This should be corrected or the rationale for this inconsistenty strongly documented

david4096 commented 7 years ago

@mbaudis @ejacox @kozbo It would be great to land on one side or the other here before release. Just use reference names seems like it would make people the happiest at this point. Ideas?

Having a different way of specifying position from one protocol to the next is really confusing. To close this we might remove the alternative reference_id pattern from SearchVariantAnnotations as well.

ejacox commented 7 years ago

Does this fit within our external ids discussion. The reference_name is the identifier used within a particular reference (hg38?). In that view, reference_name is fine if we indicate somewhere what the reference is.

david4096 commented 7 years ago

@ejacox if there is a way to elegantly solve it using that mechanism, I'm not against it. My hope is that we can treat this problem simply as using the same position searches throughout the API without needing to do much more modeling. Reference ID and name are used inconsistently, it still might be incorrect to use one or the other, but let's choose one approach.

@dcolligan @delagoya ? opinions? We can close this by removing reference ID from SearchReadsRequest and SearchVariantAnnotationsRequest, which seems to align with @richarddurbin's comments here: https://github.com/ga4gh/schemas/pull/616. We would then also merge language like https://github.com/ga4gh/schemas/pull/732 to enforce that references are uniquely named in a reference set.

I believe it makes it easiest to work with some genomics data if only the reference name is required for search, since you technically don't have to have a reference set local to your instance. 1, 2, 3, are fairly portable.

If we choose IDs we should implement searching references by name https://github.com/ga4gh/schemas/pull/665. It is nice to have in either case.

Note, 23andMe opted to use accession_id when specifying a reference for range searches https://api.23andme.com/docs/reference/, which falls somewhere between using reference names and server generated identifiers.