Closed cboettig closed 3 years ago
Merging #65 (22584d9) into master (041c7d1) will decrease coverage by
4.84%
. The diff coverage is92.64%
.
@@ Coverage Diff @@
## master #65 +/- ##
==========================================
- Coverage 82.06% 77.22% -4.85%
==========================================
Files 18 20 +2
Lines 513 562 +49
==========================================
+ Hits 421 434 +13
- Misses 92 128 +36
Impacted Files | Coverage Δ | |
---|---|---|
R/bagit.R | 0.00% <0.00%> (-94.29%) |
:arrow_down: |
R/retrieve.R | 50.00% <0.00%> (ø) |
|
R/software-heritage.R | 84.84% <50.00%> (-1.36%) |
:arrow_down: |
R/query_sources.R | 85.29% <95.23%> (+4.52%) |
:arrow_up: |
R/dataone_registry.R | 85.71% <100.00%> (ø) |
|
R/default_registries.R | 100.00% <100.00%> (ø) |
|
R/query_history.R | 82.35% <100.00%> (+1.10%) |
:arrow_up: |
R/register.R | 85.71% <100.00%> (-3.76%) |
:arrow_down: |
R/resolve.R | 78.57% <100.00%> (+1.64%) |
:arrow_up: |
R/store.R | 100.00% <100.00%> (ø) |
|
... and 5 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 041c7d1...22584d9. Read the comment docs.
@cboettig great to see that DataONE can be queried by md5 hash via contentid package! Is that a recent feature of the DataONE SOLR Api?
@jhpoelen nope, it's always been there, though I didn't realize until Matt told me about it. Just a technical note: while every DataONE object includes a checksum in the SOLR record, it's up to the individual data depositor to choose which checksum algorithm to use[*]. Much of the existing catalogue is MD5, followed by SHA-1. The current default of the dataone
R package has recently moved to be SHA-256 for uploads. Obviously not being able to assume everything has an MD5 sum isn't ideal, but as I argue in the linked issue, I think it still covers a lot of use cases.
[*]: In April I computed md5, sha-1, sha-256 hashes for over 4 million objects in DataONE, over 95% of the catalogue at the time. A .tsv
file of these is available e.g. via:
tmp <- contentid::resolve("hash://sha256/cf1a63d7cf42df825ffb035f83adc2006a9d6da66a8b896339c04e3dc3865f8a")
df <- vroom::vroom(tmp)
(Most of those objects were also submitted to hash archive.org, so it should have these available too. but really that's a lot more fragile approach to my mind than having direct support for query-by-hash from the repository!)
Yeah, that type of query has been in place since the start of DataONE, but, as Carl said, its restricted to the checksum provided by the contributing repository. I would like to change that, and will work towards it, but the issue is getting all member DataONE repositories to start reporting multiple checksums for their content. The datasets in DataONE metadata records represent many petabytes of data files, and so calculating multiple checksums for them all is neither something we can centralize nor require. And different repositories have settled on different checksums for all of this content. But I think we can start asking repositories to report an array of checksums, and add that to our discovery services.
@cboettig this all looks good. Works great from cboettig/contentid@041c7d1
with:
> tmp <- contentid::resolve("hash://md5/e27c99a7f701dab97b7d09c467acf468",
registries = "https://cn.dataone.org")
But I am getting an error when I try with store=TRUE
, which seems to not have initialized the storage directory:
> tmp <- contentid::resolve("hash://md5/e27c99a7f701dab97b7d09c467acf468",
registries = "https://cn.dataone.org",
store=TRUE)
Error in registries[dir.exists(registries)][[1]] :
subscript out of bounds
Called from: path_tidy(path)
Given that the content_dir() defaults to a reasonable location, I'm not sure why I am getting this error.
What's your naming convention for registries? It seems like the URL is a bit fragile and detailed, and could be simplified with options in a named vector of registry_name: url
with sensible defaults, so users can just ask for registries by name (e.g., contentid::resolve("hash://md5/e27c99a7f701dab97b7d09c467acf468", registries = c("dataone", "hash-archive"))
. This would simplify use for most users that don't have these registry URIs memorized. And like options("repos")
, it would make for a more extensible system (albeit still depending on API adapters).
@cboettig one more though. You wrote:
query for a matching checksum value and then extract the corresponding data url
Because any object in DataONE might be on multiple repositories, the dataurl might actually be a list of urls with alternate locations that correspond to the replicaMN
locations. Should one of those be down, the others could be used to retrieve the object. What do you think about adding some error handling in case the first data URI fails with a 404, and move on to the others in the list? Sometimes nodes are down temporarily, and sometimes for longer periods. Guarding against that seems helpful.
Thanks @mbjones! hmm, good catch, the store
functionality hasn't actually been generalized to non-sha256 hashes, so we'll definitely need to fix that.
Currently, store
saves files using a naming convention of {content-dir}/data/aa/bb/aabbccxxxxx
where aabbccxxxx
is the sha256 hash. In this way, we can treat the store itself as a registry, but it's a registry that only understands sha256 hashes, so contentid
doesn't know how to retrieve things by the other hashes. We could fix this a few different ways and I'm not sure what is best:
resolve
call. This would still be pretty hokey obviously.Also great point about the naming conventions for registries. Part of the trade-off here is in supporting multiple registries that share an API. For example, contentid::register("https://www.carlboettiger.info", registries = c("https://hash-archive.carlboettiger.info", "https://hash-archive.org"))
uses my self-hosted version of hash-archive as well. Similarly, one can register to multiple local tsv files, but maybe that use case is un-motivated. For DataONE, maybe the user wants to use a test server? In general I feel weird about hardwiring base urls to APIs, but we probably need to think of a better model for extending the backends, since I agree the current thing is a bit hacky. Anyway, we can definitely make it just support simple names instead of URIs for now. Meanwhile, I imagine most users will just stick with the defaults, and not remember what list of services resolve
is searching anyhow.
Re replica nodes, yeah, I thought about that. But it looks like we are just getting back resolve
URLs, and not the the download URLs of the replica nodes:
contentid::query_sources("hash://md5/e27c99a7f701dab97b7d09c467acf468", registries = "https://cn.dataone.org")
source
1 https://cn.dataone.org/cn/v2/resolve/ess-dive-0462dff585f94f8-20180716T160600643874
date
1 2021-01-21 04:56:04
iirc, that resolve URL is already automatically handling the task of selecting which replicate node to actually serve the data from. I know we can get the actual contentURL list for each member node, but also recall that requires some additional requests to your servers? Do you think we should do that or stick with the resolve
URL we currently get back?
In general, contentid
loves getting back a vector of URLs for the content, and the resolve
function will already iterate over those if for some reason the first request fails to return content with the matching hash.
@mbjones oh, one (hopefully) small thing: do you think you could make the checksums display somehow on the search.dataone.org landing pages? To me that would be easier and probably more valuable than getting DataONE nodes to always report all hashes. Currently I have to query the SOLR api to figure out what hash was used on a given product.
@cboettig I put in a ticket to display checksums in our landing pages in https://github.com/NCEAS/metacatui/issues/1651
Regarding the resolve URI versus object uri in DataONE. that makes sense. In DataONE, we support three types of URIs:
1) Global, service independent URIs for datasets and other objects
Accept: text/xml
curl -H "Accept: text/xml" https://cn.dataone.org/cn/v2/resolve/ess-dive-457358fdc81d3a5-20180726T203952542
3) Object location URIs: the specific location on a repository that can be retrieved to the the bytes of an objectSo, right now you are getting (2) a resolve URI back from the SOLR query. To get a list of the current location URIs for that, you would need to open that resolve URI with the text/xml
Accept header, which will return a list of object URIs like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:objectLocationList xmlns:ns2="http://ns.dataone.org/service/types/v1">
<identifier>ess-dive-457358fdc81d3a5-20180726T203952542</identifier>
<objectLocation>
<nodeIdentifier>urn:node:ESS_DIVE</nodeIdentifier>
<baseURL>https://data.ess-dive.lbl.gov/catalog/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:KNB</nodeIdentifier>
<baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:UIC</nodeIdentifier>
<baseURL>https://dataone.lib.uic.edu/metacat/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://dataone.lib.uic.edu/metacat/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:mnORC1</nodeIdentifier>
<baseURL>https://mn-orc-1.dataone.org/knb/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://mn-orc-1.dataone.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542</url>
</objectLocation>
</ns2:objectLocationList>
The idea is that the specific object URIs at which an object might be found are transient over time, and the resolution service keeps track of where they currently are located. So, its fine to cache the resolve URI (2) as it will always given you a current list, but don't expect the object location URIs to be persistent. So, if contentId did (2) then it would get back a whole list of locations it could try.
Thanks @mbjones . Just did those two suggestions you mentioned, so:
x <- resolve("hash://md5/2ac33190eab5a5c739bad29754532d76", registries = "dataone", store = TRUE)
should work as expected now.
Note: resolve
(and similar functions) will now automatically expand the precise string, dataone
to the URI, but also still accepts the old format of giving the URI, so this won't break old code and remains flexible to alternate URLs (e.g. for alternate hash-archive hosts).
You should be able to retrieve the cached copy using the md5 hash, e.g.
x <- resolve("hash://md5/2ac33190eab5a5c739bad29754532d76", store = TRUE)
Note: keep in mind that dataone
is already a default registry, so users don't need to do registries = "dataone"
. Also this excludes all other registries from the list, in particular, the default content_dir()
used for the store. so you'd need to include the store path, resolve("hash://md5/2ac33190eab5a5c739bad29754532d76", registries = c("dataone", content_dir()), store = TRUE)
or just leave it with the default.
This adds support to query DataONE SOLR API for a hash and return the corresponding
dataUrl
if found. For example:This adds the DataONE central node as a default registry, which means that any
resolve()
request (orquery_sources()
request) will ping the DataONE SOLR API for the hash in question.@mbjones please let me know if that's not a good a default. Also let me know if you'd prefer we do this a different way; as you can see in the source-code, we currently do this as query for a matching
checksum
value and then extract the corresponding data url.Note that the registry can be set explicitly to dataone by providing that as the only registry in the list:
If we don't want to always hit dataone by default, then this would provide an opt-in way to query it. (Or the user can configure all of their preferred defaults using the env var
CONTENTID_REGISTRIES
, see?default_registries()
for details)Addresses https://github.com/NCEAS/fairdataone/issues/2
Note: currently the DataONE search landing pages do not display content hash information, nor is it typically included in the EML. Consequently, it is somewhat more clunky than it might be to determine which hash DataONE is using for any given object in its collection. It would be convenient if the web interface could make this easier to discover without the solr API.