As a dataset user I want to ensure the identify of my dataset with a SHA calculation in my metadata.

mjy commented 3 years ago

Looking at https://bioschemas.org/profiles/Dataset/0.4-DRAFT/.

When I provide a dataset and its metadata, then I want to provide a guaranteed test for users who will use that dataset, one that will prove to them that the data they have in their possession is the data I provided the metadata for. One way to do this is to calculate a SHA(s) on the components of the dataset and include those values in the metadata.

I am not an expert in these issues, but I would argue this concept is increasingly important, and will perhaps ultimately supplant the general concept of generated identifiers (e.g. #310) for datasets or data in general. SHAs are key parts of distributed data/filesharing networks etc.

It is important to distinguishing assertions of identity that are computed based on the content of the data from those that are classic identifiers (e.g. DOIs), the latter class suffer from all sorts of problems and shortcomings, including resolution, governance, rendering, etc.

To start educating users (including myself) of the need for SHA/fingerprinting of digital files I (and others, this is not my idea) would argue that we should make them critical components of the metadata. We should ensure they are not lumped into a general class of identifiers.

alaninmcr commented 3 years ago

I agree. Some checksums have been proposed as parkeon attributes https://schema.parkeon.com/search_view/checksum . Also bagit (for example) includes checksums for objects as part of the manifest of the bag. I think some wider community discussion would be useful so that similar approaches could be adopted across domains.

stain commented 3 years ago

Have you consider also using something like BagIt to add checksums?

Here's how we combine RO-Crate with BagIt: https://www.researchobject.org/ro-crate/1.1/appendix/implementation-notes.html#adding-ro-crate-to-bagit

You could also combine this with a Naming things with hashes nih: URI which you can add to Bioschemas using identifier, e.g.:

"identifier": "nih:sha-256-120;5326-9057-e12f-e2b7-4ba0-7c89-2560-a2;f"

Note that a hash should apply to a particular distribution: DataDownload rather than on a Dataset, which is not necessarily directly downloadable. For instance, the download may be a .zip or a .csv.gz and so the checksum should be at that level.

BioSchemas / specifications

As a dataset user I want to ensure the identify of my dataset with a SHA calculation in my metadata. #493