decide mechanism for identifiying and pinning data

mbjones commented 3 years ago

Data should be loaded from a public archive or a reproducible API source. However, for efficiency, we don't want to download the data every time. So we should do some of the following:

1) reference data by identifier and not via file path 2) use the contentid package to load data for robust data location based on checksum (which also caches data files locally) 3) use the dataone resolve service to find the current location of data files 4) use the pins package to pin a locally cached version of the downloaded data file

Let's discuss the exact approach we want to use for this analysis.

mbjones commented 3 years ago

I added an example in the load_data() function of how to use the contentid package to discover and load a data file based on its hash, which enables it to be loaded from any registered location for that hash. The data come from this package:

Matthew Jones, Peter Slaughter, and Ted Habermann. 2019. Quantifying FAIR: metadata improvement and guidance in the DataONE repository network. Knowledge Network for Biocomplexity. doi:10.5063/F14T6GP0.

This basically boils down to:

fair_data_hash <- "hash://sha256/77eaa2aa2037f2bd43ad5185d204ad12fba68f315a46c4b0d59bb303512288a5"
fair_data_file <- contentid::resolve(fair_data_hash, store=TRUE)
fair_data <- vroom(fair_data_file)

The fair_data_hash is tightly tied to the contents of the data file, so versioning is built in. The only issue I see with this is that it does not cite the data by its DOI identifer, and therefore doesn't also pull in all of the related metadata for the data file. If we loaded the package with dataone::getPackage() then we would have a more complete set of metadata. Let's discuss. We should also discuss how to track and record provenance relationships in this analysis.

jeanetteclark commented 3 years ago

This workflow looks good, do we need pins since contentid::resolve caches for us?

I just added a script that generates a summary table of all of the check information that deliberately does not use this workflow. In these initial stages, I plan on using this script to generate tables that help me review check information. As we update checks, I want to make sure the table reflects the changes that are made, so I will probably change the URL to point to the tip of the branch aimed for the new release of metadig-checks that will likely come from the check review.

mbjones commented 3 years ago

Agreed, contentid does caching so pins isn't needed there. pins would come into play for resources we load purely by URL. I'm not sure if we want to rely on contentid yet, as it is not on CRAN and it relies on the external hash-archive service, which does not seem to have a longevity plan. We could discuss plans with @cboettig.

jeanetteclark commented 3 years ago

contentid is on CRAN, I installed it from there this morning (requires R 4.0 though)

cboettig commented 3 years ago

Thanks for the ping! Yup, contentid has been on CRAN since August, though the >= 4.0 requirement was introduced at CRAN's request last week since base R's 4.0 version of the tools package now provides functionality similar to rappdirs for persistent storage, but plays nicer with CRAN's expectations about using tempdirs for tests and examples etc.

Speaking of storage, contentid doesn't require any interaction with hash-archive.org. contentid by default searches several content services, including hash-archive.org and software heritage, but can have one or more purely local registries (which can use either a simple tab-separated text file or a LMDB database; obviously the latter is much faster if you have a few million entries). Most contentid functions take a registries argument, which can be a list of known URLs (e.g. hash-archive and software-heritage; though been meaning to add DataONE), or a location of a .tsv file or LMDB database, e.g. contentid::register("https://github.com/NCEAS/fairdataone/issues/2", "test.tsv").

cboettig commented 3 years ago

The fair_data_hash is tightly tied to the contents of the data file, so versioning is built in. The only issue I see with this is that it does not cite the data by its DOI identifer, and therefore doesn't also pull in all of the related metadata for the data file. If we loaded the package with dataone::getPackage() then we would have a more complete set of metadata. Let's discuss. We should also discuss how to track and record provenance relationships in this analysis.

This is a really excellent point. Why not use the fair_data_hash of the metadata file as the entry-point instead? Parsing that metadata, presumably the user could then extract the fair_data_hash of the raw data file, as well as the format (e.g. tsv), the DOI, provenance, etc. To me, a script would only start right in with the hash of the data itself if the metadata/provenance etc was already communicated in some other mechanism -- i.e. maybe that metadata isn't machine readable in the first place, but still trapped inside a pdf or printed paper, but still states the content hash of the raw data file.

mbjones commented 3 years ago

Thanks, @cboettig, for all of your corrections and input. I am pretty excited about contentid in general, and this paper is a way for us to play with it in a sandbox. I am contemplating several possible changes to help support it in our systems. What would it take to get DataONE listed as a contentid registry? In particular, would we need to support multiple hash types for each object, or could we just support the ones we have for now?

cboettig commented 3 years ago

Thanks, I've been trying to give that a bit of thought as well. To get started I was thinking of just building around what you already have. As you've pointed out to me, it's easy enough to construct a solr query to see if DataONE has a given a hash. To my mind, I think that would be 'good enough' for starters -- i.e. something like: https://github.com/cboettig/contentid/blob/master/inst/examples/dataone_registry.R

Obviously this is less than ideal in that DataONE might have the content but merely be using a different hash for it. But it would still enable most workflows:

the use case of an author depositing their data in a DataONE repo (e.g. via the dataone package) and then using content identifiers. The author would then know to use the hash://sha256 since that's now the default the package uses (or if they manually chose a different hash, they'd know that too).
the use case of an author accessing "someone else's" data is nearly as smooth, they simply need to note what hash DataONE has for that data. Perhaps the DataONE web portal could make this more obvious by displaying the hash URI on the object webpage, and/or we add a trivial function that returns the right hash URI for any DataONE pid?

Sure, it would be nice if DataONE had, say, a record of each of the five hash types that hash-archive.org computes, or at least had sha256 for each of it's objects (like SoftwareHeritage does), but the use cases I can think of that would require that are not as compelling. The hash identifier only helps access, it's useless for discovery. So the main reason I'd want DataONE to know the sha256 of all its blobs is if I already had a copy of the data I wanted (say, my favourite example with that Vostok data), but I didn't know the data was in DataONE. Then I could do resolve("hash://sha256/xxxx") and go "wow, that data is on DataONE already!" But that kind of workflow is almost never going to come up. Typically I'm either the producer of the data, so I have the data and can choose what dataone hash it gets, or I discovered the data by some search method that brought me to DataONE and so I already know it's there. So to me, I think DataONE is already essentially contentid compatible. Am I missing something?

There's really two contentid functions in question here. I think we're mostly talking about resolve, which under the hood would just need the solr query by checksum like I linked above. register is more complicated, since in the DataONE case it means 'publish' the data (since DataONE obviously won't resolve a hash of some data that doesn't live in it's own repository network). My instinct here is that for now at least, contentid::resolve would gain the ability to search DataONE for the hash, but contentid::register() would not interact with DataONE -- users would be instructed to follow the dataone package for publishing new data.

(aside, but internally contentid can compute the same five hashes as hash-archive, i.e. if you do register with a local registry. By default it does only sha256, configurable using environmental variable CONTENTID_ALGOS. a related issue to understanding different hashes is understanding the different serializations/representations of hash ids, like named information, subresource registry, etc. I did start adding some code to handle these, but so far that's not supported in the local registry; only by hash archive. this one's obviously a more minor issue since toggling to base64 encoding or something is relatively trivial compared with switching to a different hash algorithm).

jhpoelen commented 3 years ago

Very neat to see that DataONE is committing to being queried by content hashes! Is Zenodo next?

jhpoelen commented 3 years ago

I just noticed Carl's related Zenodo feature request in https://github.com/zenodo/zenodo/issues/1985 . Perhaps @slint Alex et al. can reconsider and have a look at what it would take to query Zenodo by content hash, especially now that DataONE has that same ability.

cboettig commented 3 years ago

In particular, it looks like the field checksum is already part of the file metadata record (https://github.com/zenodo/zenodo/blob/master/zenodo/modules/records/jsonschemas/records/file-v1.0.0.json), so it would be just brilliant if you would expose that field to your Elastic API, https://help.zenodo.org/guides/search/ ?

NCEAS / fairdataone

decide mechanism for identifiying and pinning data #2