The big name | ID | UUID | version ID discussion

mbaudis commented 9 years ago

In many PRs we're now running into very differing views on when and how to use object identifiers. We, for all different areas/task teams, have to settle on one general way to use object identifiers (OID).

This discussion here does not address each object's need to implement a given type of OID, just the relationships between OID attribute name and value, and the general OID usage.

This discussion starts also based on the assumption that OID usage will not depend on a GA4GH provided arbitration system. However, this does not mean that such services won't be implemented.

Issues in which the proposed OID differ can be separated into

uniqueness (global/local/none)
content dependency (partial/full content hashing)
generation (machine vs. readable/legacy)

This opens following attributes for discussion:

id
- required (exception to the "object's need for OID type not discussed here")
- locally unique
- usually something "readable" (GSM2492834, BRCA-2015-DCIS0012 ...)
- main type for (local) references
guid
- UUIDv4
- assumed to be globally unique without central arbitration, due to its structure
- computer generated at creation
- global references (though no guarantee that not duplication of another object)
name
- not necessarily "clean"; possibly descriptive ("patient_25, PB")
- important for legacy data & "human parsing"
- not recommended for object references
versionID
- (partial/complete) object hash
- version management (e.g. combined with time stamps)
- global references with absolut uniqueness

The main points which I predict tobe discussed here are the abstaining from using name for referencing (which is different from some PRs) and the guid/UUID types.

[ In principle, this could be collapsed by using guid type identifiers for id. However, this would provide many headaches for both legacy data & privacy (e.g. you may want to pass on datasets with your locally unique identifiers, without allowing easy global identification of data references with respect to other resources). ]

jeromekelleher commented 9 years ago

Thanks for kicking this off @mbaudis, this is a key topic. I think you've summarised things very well here.

Collapsing the GUID and ID would be problematic for implementers I think, as it is useful to have control of the ID structure. Certainly it would give us major headaches in the reference server. For example, we use variant IDs of the form base64encode(datasetName:variantSetName:variantName), where variantName is something like chr:pos:alt. When we get a GET variant request for a given ID, we then use this structure to find the right dataset, variantSet and variant. If we didn't have this structure, we would have generate a GUID for every variant and read, and to maintain a global index of all GUID -> variant/read mappings, which would be rather large.

I don't think IDs should necessarily be seen as human readable, since they are server generated. In the reference server we've base64 encoded the values to discourage clients from using the ID structure outlined above, since it's not (and shouldn't be) part of the API.

lh3 commented 9 years ago

My view:

ID. Unique within a repository. Repo dependent. Not necessarily readable. Not stable. Example: an auto-incremental primary key in SQL.
Query-able name, e.g. reference name and sample name. Users are interested in putting them in an API query.
Display name, e.g. read group name, library name and read name. These names don't appear in any query. Just for display purpose.
Accession number plus version number. Stable identifier. See this paper. Accession numbers are widely used among bio databases and working well in practice.
Content digest. Computed from content in a pre-defined way.

mbaudis commented 9 years ago

@jeromekelleher I'll change the "readable" to something like "representing some local systematics". The emphasis for human readability should be put on name and description (not mentioned yet, since not strictly "OID".

@lh3 We had already separate attributes for accessions and description envisioned in metadata. I'll extend the proposal to accommodate those.

As per @diekhans, I'll move this discussion ASAP to the documentation branch.

richarddurbin commented 9 years ago

Hello all,

I had dinner with David Glazer and other Google people on Tuesday night, and this topic came up with respect to columns/callsets in VariantSets. I came up with a new approach, which after some refinement in discussionI think has guarded approval at least as an approach worth discussing.

Currently for a CallSet in a VariantSet there is something like ID columnId ; // system generated unique id string name ; // user defined text ID sampleId ; // should be system generated unique id map{string,string} attributes ; // other metadata

Note that I have used the type ID for ids over which the user should have no control, which arguably should be opaque to end users. These are strings now. You can imagine ID as a typedef to string.

The revised proposal is Label columnLabel ; // required, user-defined, unique within the variant set ID sampleId ; // optional ID methodId ; // an id for a “method" object that can hold metadata about how the calls were made, optional

The new concept is Label, which is a string of printable characters, possibly with restricted syntax e.g. [a-zA-Z][a-zA-Z0-9_]* like programming language identifiers. The aim is that this should be a reasonable thing for users to look at and use, e.g. to display as column headings, or use in queries. This replaces both columnId, which is server-unique and not under user control, and name, which is user-defined and can’t be relied on to be unique. We are OK to require the labels to be unique, because the scope is limited to the CallSet, so long range clashes are not a problem. The semantics of merging VariantSets can be specified as part of the merge operation - both options of maintaining uniqueness by adding suffixes if necessary or forcing merge when the labels coincide have merit in different circumstances. Because the labels must be unique, they can be used as a key for retrieval of CallSet columns in the VariantSet. Having them separate from the sampleId means that you can have two CallSet columns for the same sample in one VariantSet, which is nice to be able to compare call sets made in different ways from the same sample (e.g. Illumina versus CG, or different callers). Having a methodId that references an external Method object, which contains a metadata map, allows columns to share the same information about how they were called. In fact, we think it would make sense for the VariantSet itself to have an optional methodId, which would then act as a default for all its callsets, covering the standard case where all callsets are made the same way. It is good for both sampleId and methodId to be optional, so that lightweight VariantSets can be made by importing VCF files or the equivalent, without having to create lots of other empty objects first. We should not underestimate the importance of lightweight use of the object representation and API - a large amount of sequence data handling is done in small labs with LIMS that will autopopulate Sample objects, or in exploratory analysis, and we would like users not to have to manage unnecessary appendages.

I guess I should make a pull request.

I think the same approach using Label with uniqueness within a limited scope could be used in other places, e.g. for References in a ReferenceSet.

Richard

On 23 Sep 2015, at 07:57, Michael Baudis notifications@github.com wrote:

@jeromekelleher https://github.com/jeromekelleher I'll change the "readable" to something like "representing some local systematics". The emphasis for human readability should be put on name and description (not mentioned yet, since not strictly "OID".

@lh3 https://github.com/lh3 We had already separate attributes for accessions and description envisioned in metadata. I'll extend the proposal to accommodate those.

As per @diekhans https://github.com/diekhans, I'll move this discussion ASAP to the documentation branch.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/418#issuecomment-142514069.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

mlin commented 9 years ago

(program note: @richarddurbin opened PR #427 indeed!)

@mbaudis I enter this late so please disregard with my apologies if this is well-trodden ground: Was consideration given to a URI scheme for guids, as a way of preserving some degree of human readability? That is, a global domain prefix to the locally unique id. Once upon a time there was a whole consortium project just about this, LSID.

mbaudis commented 9 years ago

@mlin The guid type hasn't been discussed in this direction. Since we assume no need for a central authority for ID management or even repository tracking, anonymous collision free UUIDs seem like the best way to guarantee object identity and allowing object retrieval from a variety of alternative resources.

However, we should also add the option to expose fully a qualified URI per object, which would then depend on the local resource.

diekhans commented 9 years ago

We don't yet have a conceptual model for global object identification, so actually format of the ids is a bit pre-nature.

Any method for global identification should to be linked with provenance information. Identifying the origin of data is an important use case for GUIDs.

Michael Baudis notifications@github.com writes:

@mlin The guid type hasn't been discussed in this direction. Since we assume no need for a central authority for ID management or even repository tracking, anonymous collision free UUIDs seem like the best way to guarantee object identity and allowing object retrieval from a variety of alternative resources.

However, we should also add the option to expose fully a qualified URI per object, which would then depend on the local resource.

— Reply to this email directly or view it on GitHub.*

ga4gh / ga4gh-schemas

The big name | ID | UUID | version ID discussion #418