ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.
Apache License 2.0
60 stars 53 forks source link

"UUID" and corresponding query endpoint #215

Open uniqueg opened 5 years ago

uniqueg commented 5 years ago

Synopsis

A strong case can be made for the usefulness of providing means to uniquely identify data objects even in DRS v1.0. Assigning a hash sum-based "universally unique identifer" (UUID; quoted because it will very likely not conform to RFC 4133) to each data object (probably in addition to a less unwieldy "locally unique" identifier) may provide such a means.

Use cases

Implementation

While the assignment of random, data-independent UUIDs would be computationally cheap, this mechanism is rather unfeasible, as UUID assignment would not be consistent for multiple copies of the same data object without resorting to centralized or distributed (e.g., blockchain) data registries.

Data-dependent generation of unique identifiers via the application of one or more standardized cryptographic hash functions computed for a given data object would forego the need for such registries. And since libraries for computing common hash functions such as MD5, SHA-1, SHA-256 & SHA-512 are widely available for all major programming languages, realizing a hash function-based mechanism for assigning quasi-unique identifiers would put only a minimal burden for DRS implementers. A downside of this approach is that calculating hash sums is relatively expensive, and the resources required to compute them are generally inversely proportional to the likelihood of a hash function producing the same hash sum for non-identical data objects.

Discussion points

Critical

Minor/syntax

Related issues

78 suggests an endpoint to search data objects by aliases.

uniqueg commented 5 years ago

This point was basically already raised by Susheel during last week's hackathon, but I felt that it fell somewhat under the radar and didn't receive the consideration which I think it deserves given that the food was threatening to get cold... ;-)

I also felt that Susheel's using of hash sums as object IDs in his demonstration may have distracted from the usefulness of a checksum as an additional metadata field (paired with a dedicated query endpoint), as I think that those two serve quite different purposes (as outlined in the issue description above). Of course, individual DRS implementers might still choose to assign the checksum as the object ID as well, if they so choose, and might do so especially if the checksum but not the corresponding endpoint would make it to DRS v1.0.

sarpera commented 5 years ago

@uniqueg I strongly agree that checksum values are very useful for the use cases you mentioned, especially for findability and DRS-of-DRSes cases. Though I feel like the added value of using checksum value as a unique identifier among DRS implementations would be lost if there is no consensus on the hash function as you stated, which we could be very challenging to achieve. Even if there was a consensus, I could only guess it would take quite a lot of time and resources for DRS implementors to create checksums for their existing/future data, at least for DRS v1.0. Also, although collections/bundles are still very vague, creating a new bundle (POST request) would become an async operation in order to calculate the checksum of checksums of its children, just to return its ID as a response.

An alternative would be to be able to make a look up by a checksum value e.g /objects/?checksum_value=<string> or a similar endpoint, or using checksum value as an alias property which should guarantee that the response for /objects/?alias=<checksum_value> would be a single object similar to /objects/<id>, if it exists.