Open uniqueg opened 5 years ago
This point was basically already raised by Susheel during last week's hackathon, but I felt that it fell somewhat under the radar and didn't receive the consideration which I think it deserves given that the food was threatening to get cold... ;-)
I also felt that Susheel's using of hash sums as object IDs in his demonstration may have distracted from the usefulness of a checksum as an additional metadata field (paired with a dedicated query endpoint), as I think that those two serve quite different purposes (as outlined in the issue description above). Of course, individual DRS implementers might still choose to assign the checksum as the object ID as well, if they so choose, and might do so especially if the checksum but not the corresponding endpoint would make it to DRS v1.0
.
@uniqueg I strongly agree that checksum values are very useful for the use cases you mentioned, especially for findability and DRS-of-DRSes cases. Though I feel like the added value of using checksum value as a unique identifier among DRS implementations would be lost if there is no consensus on the hash function as you stated, which we could be very challenging to achieve. Even if there was a consensus, I could only guess it would take quite a lot of time and resources for DRS implementors to create checksums for their existing/future data, at least for DRS v1.0
. Also, although collections/bundles are still very vague, creating a new bundle (POST request) would become an async operation in order to calculate the checksum of checksums of its children, just to return its ID as a response.
An alternative would be to be able to make a look up by a checksum value e.g /objects/?checksum_value=<string>
or a similar endpoint, or using checksum value as an alias
property which should guarantee that the response for /objects/?alias=<checksum_value>
would be a single object
similar to /objects/<id>
, if it exists.
Synopsis
A strong case can be made for the usefulness of providing means to uniquely identify data objects even in
DRS v1.0
. Assigning a hash sum-based "universally unique identifer" (UUID; quoted because it will very likely not conform to RFC 4133) to each data object (probably in addition to a less unwieldy "locally unique" identifier) may provide such a means.Use cases
Implementation
While the assignment of random, data-independent UUIDs would be computationally cheap, this mechanism is rather unfeasible, as UUID assignment would not be consistent for multiple copies of the same data object without resorting to centralized or distributed (e.g., blockchain) data registries.
Data-dependent generation of unique identifiers via the application of one or more standardized cryptographic hash functions computed for a given data object would forego the need for such registries. And since libraries for computing common hash functions such as MD5, SHA-1, SHA-256 & SHA-512 are widely available for all major programming languages, realizing a hash function-based mechanism for assigning quasi-unique identifiers would put only a minimal burden for DRS implementers. A downside of this approach is that calculating hash sums is relatively expensive, and the resources required to compute them are generally inversely proportional to the likelihood of a hash function producing the same hash sum for non-identical data objects.
Discussion points
Critical
v1.0
? And is a query endpoint desirable, too, for the first version?drs://host/id
assigned to each individual copy of a data object, and (2) a hash sum unique for all copies of a data object that is likely used only internally/programmatically for the above-mentioned use cases? If so, are there suggestions how this could be avoided while maintaining both ease-of-use and the power of uniquely identifying data objects across services?Minor/syntax
Related issues
78 suggests an endpoint to search data objects by aliases.