cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
44 stars 2 forks source link

suggest to add reference to Elliott et. al 2020 #84

Open jhpoelen opened 2 years ago

jhpoelen commented 2 years ago

Great to see that you and Matt are working on a contentid paper, and thanks for mentioning Preston in:

https://github.com/cboettig/contentid/blob/665f0e9e8fb240d2629b21efc7c74ac8e83a11eb/paper/paper.Rmd#L768

Suggest to cite:

MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132

by @mielliott for context and related work.

mbjones commented 2 years ago

Thanks @jhpoelen for the pointer. I hadn't caught that reference earlier. Just read it, nice paper. And I agree with the stance the paper takes and think we should cite it. Nevertheless, it also raised some points that would be nice to clarify. See below if you have interest.

TL;DR: I agree contentid approaches are great, and would improve identifier systems a lot. I'm not sure the paper accounted for the DataONE PID (Persisitent Identifier) system and its use of a resolver service to find the ephemeral URI locations for persistently identified, immutable objects in the DataONE network.

Long ramble about DataONE Persistent Identifiers and Resolver service

I'd love to discuss some of the nuances of the assertions and conclusions in that paper. In DataONE, we worked hard to provide a means for location-independent identifiers, and to ensure that the identifier system in DataONE accommodates versioning and provenance relationships, even though many of the contributing repositories don't provide that information. As an aggregator, there is only a limited degree of influence over what the data providers do with the content they hold. Our recommended stance is that, regardless of the identifier system used, repositories should 1) mint a location-agnostic identifier (PID) for their content, which we treat as an opaque reference to a checksum-immutable object; 2) provide versioning and provenance relationships linking these immutable objects, and 3) replicate the objects across multiple repositories in the network for both backup and high availability. The DataONE resolver service then can provide the current locations for any given object in the network identified by a PID -- at no point should people rely upon historically cached URIs for those locations, as those are subject to rapid change. In other words, the service URIs for accessing DataONE registered objects are ephemeral, but the object identifiers are persistent. In your tests, did you use the resolver service, and did you try the multiple replica locations for a given object PID when determining whether it is available?

Another point concerns the concept of a "Dataset", in the "dcat:Dataset" sense. As in DCAT, in DataONE we use "Dataset" as synonymouns with "Data Package", which represents an aggregation (in the ORE sense) of individually identified digital objects that represents a scientifically useful (and citable) collection. Most repositories provide DOI identifiers for these data packages, rather than at the individual object level. So, when you are looking at persistence in your study, were you looking at "Dataset" persistence? And how did you account for the idea that any given dataset might be composed of dozens (or hundreds of thousands) of individually identifiable digital objects, each with its own unique hash checksum and varying levels of persistence? In your paper, I did not understand what the contentid would be for a "Dataset" such as the one that is viewable on DataONE here: https://search.dataone.org/view/doi%3A10.15485%2F1842334 That dataset consists of many digitial files, each with its own persistent identifier and checksum. The metadata file in that Dataset has the PID ess-dive-0d52dba18c3904f-20220125T193640771 and the DataONE resolver service shows that it is accessible in four locations in the network:

$ curl https://cn.dataone.org/cn/v2/resolve/ess-dive-0d52dba18c3904f-20220125T193640771
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:objectLocationList xmlns:ns2="http://ns.dataone.org/service/types/v1">
    <identifier>ess-dive-0d52dba18c3904f-20220125T193640771</identifier>
    <objectLocation>
        <nodeIdentifier>urn:node:CN</nodeIdentifier>
        <baseURL>https://cn.dataone.org/cn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://cn.dataone.org/cn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
    </objectLocation>
    <objectLocation>
        <nodeIdentifier>urn:node:KNB</nodeIdentifier>
        <baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
    </objectLocation>
    <objectLocation>
        <nodeIdentifier>urn:node:ESS_DIVE</nodeIdentifier>
        <baseURL>https://data.ess-dive.lbl.gov/catalog/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
    </objectLocation>
    <objectLocation>
        <nodeIdentifier>urn:node:UIC</nodeIdentifier>
        <baseURL>https://dataone.lib.uic.edu/metacat/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://dataone.lib.uic.edu/metacat/d1/mn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
    </objectLocation>
</ns2:objectLocationList>

In contrast, the CSV data object in that data package has the PID ess-dive-9ffb26bf9b0a2a7-20220112T000129646599 and is accessible in two locations in the network:

$ curl https://cn.dataone.org/cn/v2/resolve/ess-dive-9ffb26bf9b0a2a7-20220112T000129646599
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:objectLocationList xmlns:ns2="http://ns.dataone.org/service/types/v1">
    <identifier>ess-dive-9ffb26bf9b0a2a7-20220112T000129646599</identifier>
    <objectLocation>
        <nodeIdentifier>urn:node:KNB</nodeIdentifier>
        <baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-9ffb26bf9b0a2a7-20220112T000129646599</url>
    </objectLocation>
    <objectLocation>
        <nodeIdentifier>urn:node:ESS_DIVE</nodeIdentifier>
        <baseURL>https://data.ess-dive.lbl.gov/catalog/d1/mn</baseURL>
        <version>v1</version>
        <version>v2</version>
        <url>https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-9ffb26bf9b0a2a7-20220112T000129646599</url>
    </objectLocation>
</ns2:objectLocationList>

Did you check all of these locations when you were assessing reliability of the PIDs? And did you account for the fact that, from day-to-day, any of the service URLs are ephemeral in DataONE, but the PIDs are persistent? And how does this affect the stability percentages you reported?

I would argue that the best form for these PIDs would be as contentid hashes (rather than UUIDs or any of the other formats typically in use i the network). In which case, I think the DataONE network closely matches the design of hash-archive and similar systems. I'm fully aligned with your conclusions about the utility of contentids as one of the best formats for persistent identifiers of digital objects, but I think there are still barriers to deploying them at scale across a highly heterogeneous network of repositories that have widely varying views on the utility of immutability. But I'm looking forward to trying.

cboettig commented 2 years ago

Thanks @jhpoelen for pointing us to this! Really nice paper.

As you both know I think, I ran a similar experiment on 4,047,485 of the DataONE PIDs returned by the DataONE API (I didn't get around to the largest objects) in this R script. Here's the resulting dataone.tsv. From this I see:

library(readr)
library(dplyr)

d <- read_csv("https://minio.thelio.carlboettiger.info/shared-data/dataone-hashes.tsv")
d %>% count(status)
#> # A tibble: 2 × 2
#>   status       n
#>    <int>   <int>
#> 1    200 3644732
#> 2    404  402753
d %>% summarise(missing = mean(is.na(sha256)))
#> # A tibble: 1 × 1
#>  missing
#>    <dbl>
#> 1  0.0995
d2 <- d %>% mutate(domain = urltools::domain(source))
d2 %>% filter(is.na(sha256)) %>% count(domain, sort=TRUE)
#> # A tibble: 28 × 2
#>    domain                        n
#>    <chr>                     <int>
#>  1 datadryad.org            289817
#>  2 dataone.tdar.org          69145
#>  3 usgs.ornl.gov             21082
#>  4 cn.dataone.org             8009
#>  5 mn-unm-1.dataone.org       3278
#>  6 dataone-prod.pop.umn.edu   3054
#>  7 arcticdata.io              1803
#>  8 mn-orc-1.dataone.org       1641
#>  9 gstore.unm.edu             1065
#> 10 mn-ucsb-1.dataone.org       965
#> # … with 18 more rows

Created on 2022-03-12 by the reprex package (v2.0.1)

as you see, just shy of 10% of the 4 million PIDs couldn't be retrieved to compute hash. The majority of those coming from datadryad.org, which as @mbjones already told me was a known issue at that time.

I recall a variety of issues accounted for the other PIDs for which I could not resolve content, ranging from some outdated HTTPS certs to some old servers (e.g. running HTTP/0.9, which recent versions of curl libs need special opt-in flag to agree to talk to); I think most of these I could identify I reported to dataone at the time. I had a few restarts due to server/bandwidth issues, but wasn't systematic about retrying failed resolutions that might have been due to stochastic network timeouts etc.

I think the URLs recorded in my tsv are all of the format metacat/d1/mn/v2/object/<PID>. Some of my hashes may not correspond to the intended data, e.g. some 358 records correspond to a 0-byte csv, e.g. https://gmn.lternet.edu/mn/v2/object/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-ntl%2F4%2F8%2Fdaily_raft. Only about 2.25 million of the 3.6 million identifiers in my table are unique. I computed md5, sha1, and sha256 in my table but I didn't attempt to compare to the checksum on file.