uniqueg commented 5 years ago

Synopsis

A strong case can be made for the usefulness of providing means to uniquely identify data objects even in DRS v1.0. Assigning a hash sum-based "universally unique identifer" (UUID; quoted because it will very likely not conform to RFC 4133) to each data object (probably in addition to a less unwieldy "locally unique" identifier) may provide such a means.

Use cases

Identify data: Sometimes it happens that one is confronted with a piece of data that has no header/description/metadata attached. If the data had previously been deposited to a DRS instance that the user has access to, a call to the discovery service could quickly identify the data via the UUID query endpoints of known DRS instances and provide metadata.
Ensure/avoid redundancy: For data safety and even political reasons, it is desirable to ensure that important data exists in multiple copies. On the other hand, in order to save resources or for reasons of data privacy, it is desirable to minimize the level of redundancy according to the policies of a specific data provider/vendor. UUIDs would help to enforce such data redundancy policies by implementing a mechanism by which copies of data objects can be found and annotated automatically, either within individual DRS instances (via queries to the discovery service) or within the discovery service itself (centralized database).
Efficient use of caches: Data consumers may wish to analyze a given data object (e.g. when discovering a DRS ID associated with a publication) that is already available in a local cache. In most cases, one would wish to analyze the local copy of the data object as it would likely speed up the analysis and minimize costs. UUIDs could be easily leveraged by WES implementations to check whether the data a user wants to analyze is available in a local cache.
Optimize distributed computing: Allowing to quickly and unambiguously identify all accessible copies of a given data objects (given it's DRS identifier) would enable WES/TES to make more informed decisions on distributing (parts of) analysis workflows according to considerations such as data location, access rights/use restrictions, load balance and costs.
Federation of access rights: If a data consumer requests a data object that they have access rights to on at least one mirror, data providers/vendors may choose to propagate these permissions for the same data object under their own DRS instance, even though the user does not have direct access rights to that data object on that platform. UUIDs could provide a mechanism to ensure that data consumers have universal access to all copies of a data object they have access to.

Implementation

While the assignment of random, data-independent UUIDs would be computationally cheap, this mechanism is rather unfeasible, as UUID assignment would not be consistent for multiple copies of the same data object without resorting to centralized or distributed (e.g., blockchain) data registries.

Data-dependent generation of unique identifiers via the application of one or more standardized cryptographic hash functions computed for a given data object would forego the need for such registries. And since libraries for computing common hash functions such as MD5, SHA-1, SHA-256 & SHA-512 are widely available for all major programming languages, realizing a hash function-based mechanism for assigning quasi-unique identifiers would put only a minimal burden for DRS implementers. A downside of this approach is that calculating hash sums is relatively expensive, and the resources required to compute them are generally inversely proportional to the likelihood of a hash function producing the same hash sum for non-identical data objects.

Discussion points

Critical

Is a unique checksum desirable? If not, what could be possible alternatives for the outlined use cases? If so, would it be desirable for v1.0? And is a query endpoint desirable, too, for the first version?
In order to guarantee that data objects can be uniquely identified via their checksums, the community would have to agree on a hash function (or possibly more, to be computed sequentially and merged in some way) that is mandated by the specification to be computed for every data object (and possibly collection). A consensus on the hash function/s to be employed needs to balance the probability (and potential consequences!) of assigning the same checksum for non-identical data objects, "future proofness" of the function in the light of exponential increase in data objects being created, and the feasibility/costs of calculating checksums for, potentially, exabytes of already existing and future data.
In this light, it might further be advisable to agree on a strategy to change the used hash function in future versions of the specifications. Should the DRS version be added to the metadata of a data object? Should the value be a dictionary with checksum version keys?
Are there strong objections to having two unique identifiers, (1) a user-friendly one of the form drs://host/id assigned to each individual copy of a data object, and (2) a hash sum unique for all copies of a data object that is likely used only internally/programmatically for the above-mentioned use cases? If so, are there suggestions how this could be avoided while maintaining both ease-of-use and the power of uniquely identifying data objects across services?

Minor/syntax

How would a data object's metadata be affected in the specs? Should there be a separate metadata field (e.g., "checksum") or should it be a required key in the (already proposed) hash sums field? How about collections?
What would the endpoint look like? Should it return a list of DRS metadata for all matching data objects similar to that discussed for the GET collections endpoint? Should the specs enforce that there be only one data object with a given checksum per instance (in which case the return type would not need to be a list)? Alternatively, should it return just a simple list of DRS identifiers pointing to all known copies of the data object? Or even just the DRS identifier for the one (or multiple) copies of the data object on this particular instance (with finding of mirrors being delegated to a discovery service querying all known DRS instances)?

Related issues

78 suggests an endpoint to search data objects by aliases.

uniqueg commented 5 years ago

This point was basically already raised by Susheel during last week's hackathon, but I felt that it fell somewhat under the radar and didn't receive the consideration which I think it deserves given that the food was threatening to get cold... ;-)

I also felt that Susheel's using of hash sums as object IDs in his demonstration may have distracted from the usefulness of a checksum as an additional metadata field (paired with a dedicated query endpoint), as I think that those two serve quite different purposes (as outlined in the issue description above). Of course, individual DRS implementers might still choose to assign the checksum as the object ID as well, if they so choose, and might do so especially if the checksum but not the corresponding endpoint would make it to DRS v1.0.

sarpera commented 5 years ago

@uniqueg I strongly agree that checksum values are very useful for the use cases you mentioned, especially for findability and DRS-of-DRSes cases. Though I feel like the added value of using checksum value as a unique identifier among DRS implementations would be lost if there is no consensus on the hash function as you stated, which we could be very challenging to achieve. Even if there was a consensus, I could only guess it would take quite a lot of time and resources for DRS implementors to create checksums for their existing/future data, at least for DRS v1.0. Also, although collections/bundles are still very vague, creating a new bundle (POST request) would become an async operation in order to calculate the checksum of checksums of its children, just to return its ID as a response.

An alternative would be to be able to make a look up by a checksum value e.g /objects/?checksum_value=<string> or a similar endpoint, or using checksum value as an alias property which should guarantee that the response for /objects/?alias=<checksum_value> would be a single object similar to /objects/<id>, if it exists.

ga4gh / data-repository-service-schemas

"UUID" and corresponding query endpoint #215