cboettig / contentid

:package: R package for working with Content Identifiers

http://cboettig.github.io/contentid

Other

46 stars 2 forks source link

Adds DataONE to the list of resolvable registries #65

Closed cboettig closed 3 years ago

cboettig commented 3 years ago

This adds support to query DataONE SOLR API for a hash and return the corresponding dataUrl if found. For example:

tmp <- contentid::resolve("hash://md5/2ac33190eab5a5c739bad29754532d76")
df <- read.csv(tmp)

This adds the DataONE central node as a default registry, which means that any resolve() request (or query_sources() request) will ping the DataONE SOLR API for the hash in question.

@mbjones please let me know if that's not a good a default. Also let me know if you'd prefer we do this a different way; as you can see in the source-code, we currently do this as query for a matching checksum value and then extract the corresponding data url.

Note that the registry can be set explicitly to dataone by providing that as the only registry in the list:

tmp <- contentid::resolve("hash://md5/2ac33190eab5a5c739bad29754532d76", 
                                         registries = "https://cn.dataone.org")
df <- read.csv(tmp)

If we don't want to always hit dataone by default, then this would provide an opt-in way to query it. (Or the user can configure all of their preferred defaults using the env var CONTENTID_REGISTRIES, see ?default_registries() for details)

Addresses https://github.com/NCEAS/fairdataone/issues/2

Note: currently the DataONE search landing pages do not display content hash information, nor is it typically included in the EML. Consequently, it is somewhat more clunky than it might be to determine which hash DataONE is using for any given object in its collection. It would be convenient if the web interface could make this easier to discover without the solr API.

codecov-io commented 3 years ago

Codecov Report

Merging #65 (22584d9) into master (041c7d1) will decrease coverage by 4.84%. The diff coverage is 92.64%.

@@            Coverage Diff             @@
##           master      #65      +/-   ##
==========================================
- Coverage   82.06%   77.22%   -4.85%     
==========================================
  Files          18       20       +2     
  Lines         513      562      +49     
==========================================
+ Hits          421      434      +13     
- Misses         92      128      +36

Impacted Files	Coverage Δ
R/bagit.R	`0.00% <0.00%> (-94.29%)`	:arrow_down:
R/retrieve.R	`50.00% <0.00%> (ø)`
R/software-heritage.R	`84.84% <50.00%> (-1.36%)`	:arrow_down:
R/query_sources.R	`85.29% <95.23%> (+4.52%)`	:arrow_up:
R/dataone_registry.R	`85.71% <100.00%> (ø)`
R/default_registries.R	`100.00% <100.00%> (ø)`
R/query_history.R	`82.35% <100.00%> (+1.10%)`	:arrow_up:
R/register.R	`85.71% <100.00%> (-3.76%)`	:arrow_down:
R/resolve.R	`78.57% <100.00%> (+1.64%)`	:arrow_up:
R/store.R	`100.00% <100.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 041c7d1...22584d9. Read the comment docs.

jhpoelen commented 3 years ago

@cboettig great to see that DataONE can be queried by md5 hash via contentid package! Is that a recent feature of the DataONE SOLR Api?

cboettig commented 3 years ago

@jhpoelen nope, it's always been there, though I didn't realize until Matt told me about it. Just a technical note: while every DataONE object includes a checksum in the SOLR record, it's up to the individual data depositor to choose which checksum algorithm to use[*]. Much of the existing catalogue is MD5, followed by SHA-1. The current default of the dataone R package has recently moved to be SHA-256 for uploads. Obviously not being able to assume everything has an MD5 sum isn't ideal, but as I argue in the linked issue, I think it still covers a lot of use cases.

[*]: In April I computed md5, sha-1, sha-256 hashes for over 4 million objects in DataONE, over 95% of the catalogue at the time. A .tsv file of these is available e.g. via:

tmp <- contentid::resolve("hash://sha256/cf1a63d7cf42df825ffb035f83adc2006a9d6da66a8b896339c04e3dc3865f8a")
df <- vroom::vroom(tmp)

(Most of those objects were also submitted to hash archive.org, so it should have these available too. but really that's a lot more fragile approach to my mind than having direct support for query-by-hash from the repository!)

mbjones commented 3 years ago

Yeah, that type of query has been in place since the start of DataONE, but, as Carl said, its restricted to the checksum provided by the contributing repository. I would like to change that, and will work towards it, but the issue is getting all member DataONE repositories to start reporting multiple checksums for their content. The datasets in DataONE metadata records represent many petabytes of data files, and so calculating multiple checksums for them all is neither something we can centralize nor require. And different repositories have settled on different checksums for all of this content. But I think we can start asking repositories to report an array of checksums, and add that to our discovery services.

mbjones commented 3 years ago

@cboettig this all looks good. Works great from cboettig/contentid@041c7d1 with:

> tmp <- contentid::resolve("hash://md5/e27c99a7f701dab97b7d09c467acf468", 
                             registries = "https://cn.dataone.org")

But I am getting an error when I try with store=TRUE, which seems to not have initialized the storage directory:

> tmp <- contentid::resolve("hash://md5/e27c99a7f701dab97b7d09c467acf468", 
                             registries = "https://cn.dataone.org", 
                             store=TRUE)
Error in registries[dir.exists(registries)][[1]] : 
  subscript out of bounds
Called from: path_tidy(path)

Given that the content_dir() defaults to a reasonable location, I'm not sure why I am getting this error.

What's your naming convention for registries? It seems like the URL is a bit fragile and detailed, and could be simplified with options in a named vector of registry_name: url with sensible defaults, so users can just ask for registries by name (e.g., contentid::resolve("hash://md5/e27c99a7f701dab97b7d09c467acf468", registries = c("dataone", "hash-archive")). This would simplify use for most users that don't have these registry URIs memorized. And like options("repos"), it would make for a more extensible system (albeit still depending on API adapters).

mbjones commented 3 years ago

@cboettig one more though. You wrote:

query for a matching checksum value and then extract the corresponding data url

Because any object in DataONE might be on multiple repositories, the dataurl might actually be a list of urls with alternate locations that correspond to the replicaMN locations. Should one of those be down, the others could be used to retrieve the object. What do you think about adding some error handling in case the first data URI fails with a 404, and move on to the others in the list? Sometimes nodes are down temporarily, and sometimes for longer periods. Guarding against that seems helpful.

cboettig commented 3 years ago

Thanks @mbjones! hmm, good catch, the store functionality hasn't actually been generalized to non-sha256 hashes, so we'll definitely need to fix that.

Currently, store saves files using a naming convention of {content-dir}/data/aa/bb/aabbccxxxxx where aabbccxxxx is the sha256 hash. In this way, we can treat the store itself as a registry, but it's a registry that only understands sha256 hashes, so contentid doesn't know how to retrieve things by the other hashes. We could fix this a few different ways and I'm not sure what is best:

store the data in the content-store with the file named by whatever hash was used in the resolve call. This would still be pretty hokey obviously.
store the data as sha256, but also create symlinks for other hashes. (may have issues on windows?)
store the data in sha256, and register the location in a local tsv registry, and stop using the store as an implicit registry. This is probably best, but possible fragile?

Also great point about the naming conventions for registries. Part of the trade-off here is in supporting multiple registries that share an API. For example, contentid::register("https://www.carlboettiger.info", registries = c("https://hash-archive.carlboettiger.info", "https://hash-archive.org")) uses my self-hosted version of hash-archive as well. Similarly, one can register to multiple local tsv files, but maybe that use case is un-motivated. For DataONE, maybe the user wants to use a test server? In general I feel weird about hardwiring base urls to APIs, but we probably need to think of a better model for extending the backends, since I agree the current thing is a bit hacky. Anyway, we can definitely make it just support simple names instead of URIs for now. Meanwhile, I imagine most users will just stick with the defaults, and not remember what list of services resolve is searching anyhow.

Re replica nodes, yeah, I thought about that. But it looks like we are just getting back resolve URLs, and not the the download URLs of the replica nodes:

contentid::query_sources("hash://md5/e27c99a7f701dab97b7d09c467acf468",  registries = "https://cn.dataone.org")
                                                                               source
1 https://cn.dataone.org/cn/v2/resolve/ess-dive-0462dff585f94f8-20180716T160600643874
                 date
1 2021-01-21 04:56:04

iirc, that resolve URL is already automatically handling the task of selecting which replicate node to actually serve the data from. I know we can get the actual contentURL list for each member node, but also recall that requires some additional requests to your servers? Do you think we should do that or stick with the resolve URL we currently get back?

In general, contentid loves getting back a vector of URLs for the content, and the resolve function will already iterate over those if for some reason the first request fails to return content with the matching hash.

cboettig commented 3 years ago

@mbjones oh, one (hopefully) small thing: do you think you could make the checksums display somehow on the search.dataone.org landing pages? To me that would be easier and probably more valuable than getting DataONE nodes to always report all hashes. Currently I have to query the SOLR api to figure out what hash was used on a given product.

mbjones commented 3 years ago

@cboettig I put in a ticket to display checksums in our landing pages in https://github.com/NCEAS/metacatui/issues/1651

Regarding the resolve URI versus object uri in DataONE. that makes sense. In DataONE, we support three types of URIs:

1) Global, service independent URIs for datasets and other objects

e.g., https://dataone.org/datasets/ess-dive-457358fdc81d3a5-20180726T203952542
redirects to the current landing page for that identifier 2) Current resolve service URI (versioned) -- returns current list of replica object locations, redirecting to the first if resolved with an Accept header for text/html, or to a list of object locations with Accept: text/xml
e.g., curl -H "Accept: text/xml" https://cn.dataone.org/cn/v2/resolve/ess-dive-457358fdc81d3a5-20180726T203952542 3) Object location URIs: the specific location on a repository that can be retrieved to the the bytes of an object
e.g., https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542

So, right now you are getting (2) a resolve URI back from the SOLR query. To get a list of the current location URIs for that, you would need to open that resolve URI with the text/xml Accept header, which will return a list of object URIs like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:objectLocationList xmlns:ns2="http://ns.dataone.org/service/types/v1">
    <identifier>ess-dive-457358fdc81d3a5-20180726T203952542</identifier>
    <objectLocation>
        <nodeIdentifier>urn:node:ESS_DIVE</nodeIdentifier>
        <baseURL>https://data.ess-dive.lbl.gov/catalog/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
    </objectLocation>
    <objectLocation>
        <nodeIdentifier>urn:node:KNB</nodeIdentifier>
        <baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
    </objectLocation>
    <objectLocation>
        <nodeIdentifier>urn:node:UIC</nodeIdentifier>
        <baseURL>https://dataone.lib.uic.edu/metacat/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://dataone.lib.uic.edu/metacat/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
    </objectLocation>
    <objectLocation>
        <nodeIdentifier>urn:node:mnORC1</nodeIdentifier>
        <baseURL>https://mn-orc-1.dataone.org/knb/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://mn-orc-1.dataone.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
    </objectLocation>
</ns2:objectLocationList>

The idea is that the specific object URIs at which an object might be found are transient over time, and the resolution service keeps track of where they currently are located. So, its fine to cache the resolve URI (2) as it will always given you a current list, but don't expect the object location URIs to be persistent. So, if contentId did (2) then it would get back a whole list of locations it could try.

cboettig commented 3 years ago

Thanks @mbjones . Just did those two suggestions you mentioned, so:

 x <- resolve("hash://md5/2ac33190eab5a5c739bad29754532d76",  registries = "dataone", store = TRUE)

should work as expected now.

Note: resolve (and similar functions) will now automatically expand the precise string, dataone to the URI, but also still accepts the old format of giving the URI, so this won't break old code and remains flexible to alternate URLs (e.g. for alternate hash-archive hosts).

You should be able to retrieve the cached copy using the md5 hash, e.g.

 x <- resolve("hash://md5/2ac33190eab5a5c739bad29754532d76", store = TRUE)

Note: keep in mind that dataone is already a default registry, so users don't need to do registries = "dataone". Also this excludes all other registries from the list, in particular, the default content_dir() used for the store. so you'd need to include the store path, resolve("hash://md5/2ac33190eab5a5c739bad29754532d76", registries = c("dataone", content_dir()), store = TRUE) or just leave it with the default.