Open mbjones opened 3 years ago
I added an example in the load_data()
function of how to use the contentid
package to discover and load a data file based on its hash, which enables it to be loaded from any registered location for that hash. The data come from this package:
Matthew Jones, Peter Slaughter, and Ted Habermann. 2019. Quantifying FAIR: metadata improvement and guidance in the DataONE repository network. Knowledge Network for Biocomplexity. doi:10.5063/F14T6GP0.
This basically boils down to:
fair_data_hash <- "hash://sha256/77eaa2aa2037f2bd43ad5185d204ad12fba68f315a46c4b0d59bb303512288a5"
fair_data_file <- contentid::resolve(fair_data_hash, store=TRUE)
fair_data <- vroom(fair_data_file)
The fair_data_hash
is tightly tied to the contents of the data file, so versioning is built in. The only issue I see with this is that it does not cite the data by its DOI identifer, and therefore doesn't also pull in all of the related metadata for the data file. If we loaded the package with dataone::getPackage()
then we would have a more complete set of metadata. Let's discuss. We should also discuss how to track and record provenance relationships in this analysis.
This workflow looks good, do we need pins
since contentid::resolve
caches for us?
I just added a script that generates a summary table of all of the check information that deliberately does not use this workflow. In these initial stages, I plan on using this script to generate tables that help me review check information. As we update checks, I want to make sure the table reflects the changes that are made, so I will probably change the URL to point to the tip of the branch aimed for the new release of metadig-checks
that will likely come from the check review.
Agreed, contentid
does caching so pins isn't needed there. pins
would come into play for resources we load purely by URL. I'm not sure if we want to rely on contentid yet, as it is not on CRAN and it relies on the external hash-archive service, which does not seem to have a longevity plan. We could discuss plans with @cboettig.
contentid
is on CRAN, I installed it from there this morning (requires R 4.0 though)
Thanks for the ping! Yup, contentid
has been on CRAN since August, though the >= 4.0 requirement was introduced at CRAN's request last week since base R's 4.0 version of the tools
package now provides functionality similar to rappdirs
for persistent storage, but plays nicer with CRAN's expectations about using tempdirs for tests and examples etc.
Speaking of storage, contentid
doesn't require any interaction with hash-archive.org. contentid
by default searches several content services, including hash-archive.org and software heritage, but can have one or more purely local registries (which can use either a simple tab-separated text file or a LMDB database; obviously the latter is much faster if you have a few million entries). Most contentid
functions take a registries
argument, which can be a list of known URLs (e.g. hash-archive and software-heritage; though been meaning to add DataONE), or a location of a .tsv
file or LMDB database, e.g. contentid::register("https://github.com/NCEAS/fairdataone/issues/2", "test.tsv")
.
The fair_data_hash is tightly tied to the contents of the data file, so versioning is built in. The only issue I see with this is that it does not cite the data by its DOI identifer, and therefore doesn't also pull in all of the related metadata for the data file. If we loaded the package with dataone::getPackage() then we would have a more complete set of metadata. Let's discuss. We should also discuss how to track and record provenance relationships in this analysis.
This is a really excellent point. Why not use the fair_data_hash
of the metadata file as the entry-point instead? Parsing that metadata, presumably the user could then extract the fair_data_hash
of the raw data file, as well as the format (e.g. tsv), the DOI, provenance, etc. To me, a script would only start right in with the hash of the data itself if the metadata/provenance etc was already communicated in some other mechanism -- i.e. maybe that metadata isn't machine readable in the first place, but still trapped inside a pdf or printed paper, but still states the content hash of the raw data file.
Thanks, @cboettig, for all of your corrections and input. I am pretty excited about contentid in general, and this paper is a way for us to play with it in a sandbox. I am contemplating several possible changes to help support it in our systems. What would it take to get DataONE listed as a contentid registry? In particular, would we need to support multiple hash types for each object, or could we just support the ones we have for now?
Thanks, I've been trying to give that a bit of thought as well. To get started I was thinking of just building around what you already have. As you've pointed out to me, it's easy enough to construct a solr query to see if DataONE has a given a hash. To my mind, I think that would be 'good enough' for starters -- i.e. something like: https://github.com/cboettig/contentid/blob/master/inst/examples/dataone_registry.R
Obviously this is less than ideal in that DataONE might have the content but merely be using a different hash for it. But it would still enable most workflows:
dataone
package) and then using content identifiers. The author would then know to use the hash://sha256
since that's now the default the package uses (or if they manually chose a different hash, they'd know that too).Sure, it would be nice if DataONE had, say, a record of each of the five hash types that hash-archive.org computes, or at least had sha256
for each of it's objects (like SoftwareHeritage does), but the use cases I can think of that would require that are not as compelling. The hash identifier only helps access, it's useless for discovery. So the main reason I'd want DataONE to know the sha256 of all its blobs is if I already had a copy of the data I wanted (say, my favourite example with that Vostok data), but I didn't know the data was in DataONE. Then I could do resolve("hash://sha256/xxxx")
and go "wow, that data is on DataONE already!" But that kind of workflow is almost never going to come up. Typically I'm either the producer of the data, so I have the data and can choose what dataone hash it gets, or I discovered the data by some search method that brought me to DataONE and so I already know it's there. So to me, I think DataONE is already essentially contentid
compatible. Am I missing something?
There's really two contentid
functions in question here. I think we're mostly talking about resolve
, which under the hood would just need the solr query by checksum like I linked above. register
is more complicated, since in the DataONE case it means 'publish' the data (since DataONE obviously won't resolve a hash of some data that doesn't live in it's own repository network). My instinct here is that for now at least, contentid::resolve
would gain the ability to search DataONE for the hash, but contentid::register()
would not interact with DataONE -- users would be instructed to follow the dataone
package for publishing new data.
(aside, but internally contentid
can compute the same five hashes as hash-archive, i.e. if you do register
with a local registry. By default it does only sha256, configurable using environmental variable CONTENTID_ALGOS
. a related issue to understanding different hashes is understanding the different serializations/representations of hash ids, like named information, subresource registry, etc. I did start adding some code to handle these, but so far that's not supported in the local registry; only by hash archive. this one's obviously a more minor issue since toggling to base64 encoding or something is relatively trivial compared with switching to a different hash algorithm).
Very neat to see that DataONE is committing to being queried by content hashes! Is Zenodo next?
I just noticed Carl's related Zenodo feature request in https://github.com/zenodo/zenodo/issues/1985 . Perhaps @slint Alex et al. can reconsider and have a look at what it would take to query Zenodo by content hash, especially now that DataONE has that same ability.
In particular, it looks like the field checksum
is already part of the file metadata record (https://github.com/zenodo/zenodo/blob/master/zenodo/modules/records/jsonschemas/records/file-v1.0.0.json), so it would be just brilliant if you would expose that field to your Elastic API, https://help.zenodo.org/guides/search/ ?
Data should be loaded from a public archive or a reproducible API source. However, for efficiency, we don't want to download the data every time. So we should do some of the following:
1) reference data by identifier and not via file path 2) use the
contentid
package to load data for robust data location based on checksum (which also caches data files locally) 3) use thedataone
resolve service to find the current location of data files 4) use thepins
package to pin a locally cached version of the downloaded data fileLet's discuss the exact approach we want to use for this analysis.