Add `cite()` function? - Githubissues

cboettig / contentid

:package: R package for working with Content Identifiers

http://cboettig.github.io/contentid

Other

45 stars 2 forks source link

Add `cite()` function? #69

Open cboettig opened 3 years ago

cboettig commented 3 years ago

Now that we can resolve to content in data repositories such as DataONE network or Zenodo, it might make sense to also be able to report citation information from those repositories.

With Zenodo, queries always match hash to the record containing the matching file, and the DOI is readily available (doi_url field), as is remaining citation metadata. With DataONE, not all objects have DOI and tracking from PID to related DOI is a bit less obvious, but all information is still present there somewhere.

Because the same file can be part of multiple packages (very common in 'versioned series', but also other cases), an object may have more than one DOI. In a versioned series, it would make sense to cite the most recent version in the series containing the exact match. In other cases, it may make more sense to merely return all possible citations and leave resolution to the user.

Unclear what citation should be returned for content that only resolves to other sources (URLs from hash-archive records, software heritage archives, etc). Some sources (e.g. Software Heritage) are arguably still 'citable', though ideally would resolve to the parent DOI if also found on an archive. For content resolvable by hash-archive alone, we could arguably then check the other algo ids to see if content is also on DataONE. (i.e. the sha256 id resolves only to a URL from hash-archive, but from that we get the md5 and discover that one resolves on DataONE. also, many URLs on hash-archive will already point to download URLs from DataONE or Zenodo).

Concerns:

Sometimes content will fail to resolve to a scientific archive even when it is available there, because DataONE and Zenodo recognize only some hashes (e.g. md5). Need to avoid user confusion that can arise here. (This deviates from other contentid functions that are location-agnostic, for cite() to work, suddenly we need to resolve to a specific location....)
Counter-argument: Perhaps the use-case for generating a citation given hash-identifier is poorly motivated, in that it seems unlikely that a user would somehow know the content-identifier for content without already knowing other information about it. e.g. a typical workflow might be: read paper or search DataONE catalog to discover data, copy content identifier down, use it to embed in code. Generating the bibtex might be convenient, but is arguably out-of-scope. (However, we can 'resolve' citation information for a DOI, so the analogy begs being able to do so for a content id?)

Idea for cite() suggested originally by @mbjones.

mbjones commented 3 years ago

I worked on a proof of concept for a cite()-like method. For DataONE-related identifiers to understand a bit better what is involved. The challenges were:

1) Identifying which datasets came from DataONE based on the data access URI alone returned from query_sourcea() 2) Determining which metadata document is the most appropriate for a given contentid 3) walking the version chain of each of those efficiently 4) getting bibliographic metadata in the right form for constructing a citation 5) Flexible formatting of citations following CSL and bibtex, for example

The following function works partially for DataONE. It would be better implemented as an adapter for each known registry that knows how to create the citation given the repository-specific APIs available.

library(dataone)
library(solrium)

# Function to return the text of a citation for a given contentid
get_citation <- function(contentid) {

    # Set up for SOLR queries
    solr <- SolrClient$new(host = "cn.dataone.org", path = "/cn/v2/query/solr/", scheme = "https", port=443)

    # Use query_sources to determine if a copy is on DataONE or amenable repository (based on the URL pattern)
    d1_locations <- query_sources(contentid, cols=c("identifier", "source", "date", "status", "sha1", "sha256", "md5")) %>%
        filter(grepl('cn.dataone.org|v2/object|v2/resolve', source))

    if (nrow(d1_locations) > 0) {
        # Query that network API to determine which Datasets the content identifier is associated with
        # If more than one is found, reduce the list to the most recent version of each Dataset 
        # (eliminating duplicate older versions, in favor of citing the most recent)

        # Look up the metadata for this object, including which metadata documents describe it in DataONE
        pids <- d1_locations$source %>% basename() %>% unique()
        subquery_pids <- stringr::str_replace_all(stringr::str_flatten(pids, collapse=" OR "), ":", "\\\\:")
        subquery <- paste0("id:(", subquery_pids, ") AND -obsoletedBy:*")
        fields <- 'identifier,checksum,checksumAlgorithm,datasource,isDocumentedBy,resourceMap'
        metadata <- solr$search(params = list(q=subquery, rows=100, fl=fields))

        # Retrieve the bibliographic metadata for each of those datasets by searching on the PID for each
        documented_by <- stringr::str_split(metadata$isDocumentedBy, ",")
        subquery_ids <- stringr::str_replace_all(stringr::str_flatten(documented_by[[1]], collapse=" OR "), ":", "\\\\:")
        subquery <- paste0("id:(", subquery_ids, ")")
        fields <- paste(sep=",", 'origin,identifier,formatId,checksum,checksumAlgorithm,title,datePublished,pubDate',
                                 'datasource,obsoletes, obsoletedBy, isDocumentedBy,resourceMap')
        datasets <- solr$search(params = list(q=subquery, rows=100, fl=fields))

        # Determine the repository name and URI
        repos <- listNodes(CNode())
        repo_list <- repos[sapply(repos, function(repo) { repo@identifier==datasets$datasource[[1]]}, simplify = TRUE ) ]

        # Return a list of citations, one for each matching Dataset (possibly providing different formats 
        # for returning the citation information (text string, bibtex, CSL-formatted string))
        # Currently this only returns the first result as a POC, needs work
        citation_text <- paste0(datasets$origin[[1]], ". ", format(as.Date(datasets$pubDate[[1]]), "%Y"), ". ", 
                               datasets$title[[1]], ". ", repo_list[[1]]@name, ". ", datasets$identifier, " ",
                               paste0("https://search.dataone.org/view/", datasets$identifier[[1]]))
        return(citation_text)
    } else {
        return(list())
    }
}

jhpoelen commented 3 years ago

@mbjones very neat! This issue remind me of an idea documented in https://github.com/bio-guoda/preston/issues/42 . Having an easy interface (citation) is a bridge from the content-universe into the traditional (location-based) publication space (e.g., scientific papers or other scholarly communication)

I noticed that you are making queries to some api solr web apis.

How would you construct a citation offline or when (in the far far future) the DataONE Solr web services are no longer around?

jhpoelen commented 3 years ago

@mbjones Just curious - would you happen to have a worked out example for the citation code? Starting from a content id, listing all the returned information from solr, and with resulting citation string(s).

I've been keeping track of DataONE metadata for quite some time now, so I am curious whether I can dig out the information from these versioned archives.

mbjones commented 3 years ago

Hey @jhpoelen, great questions. I agree with your point about this being a bridge between the traditional authority-based identifier world of DOIs, and the content-based world of identifiers. Getting cite() to work "offline", independently of the DataONE API or other repository APIs, would more or less require having offline access to the relevant metadata. The SOLR APIs in DataONE basically provide convenient, indexed access to the Entity to Dataset mapping, the version relationships among those data entities and datasets, and derived data relationships, all of which is needed to make a decision on which Dataset(s) to cite for a given Entity.

So, to take all of this offline, we'd need a good offline representation of the complex web links between all of the data entities and versioned datasets in DataONE. That is a fairly large dataset, so having indexed, searchable access to it is key to efficiently exploring the links. Thus SOLR.

In terms of a worked example, if you step the function I provided above using the vostok hash (hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37) that @cboettig frequently uses with get_citation(), it will produce the following output:

> get_citation("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")
[1] "H.A.J. Meijer,M.H. Pertuisot,J. van der Plicht. 2006. High Accuracy 14C Measurements for Atmospheric CO2 Samples from the South Pole and Point Barrow, Alaska by Accelerator Mass Spectrometry. ESS-DIVE. ess-dive-8c0779e4f3ed341-20180716T234812410 https://search.dataone.org/view/ess-dive-8c0779e4f3ed341-20180716T234812410"

You can see the intermediate dataframes that are produced by just stepping through the function. Also, note that the actual logic of this function is incomplete and brittle, and will not produce the correct citation in many cases -- it is not fully walking the version relationships and just a proof of concept at this point.

cboettig commented 3 years ago

@mbjones Thanks!

Either a technical aside or an illustration of why this feature is needed/difficult: Regarding the citation for my favorite Vostok ice core there, are we sure that is the correct citation for this data? I tend to cite https://doi.org/10.3334/CDIAC/ATG.009 (Barnola et al (2003)) instead.

As described in the abstract of that Barnola et al (2003), that data of course has a rich provenance history, I believe it corresponds to a relatively light update of data described in Pépin et al (2001) (https://doi.org/10.1029/2001JD900117), which the paper says was also deposited on CDIAC, though I can only find Petit et al (2000) https://doi.org/10.3334/CDIAC/CLI.006, and of course most of the data were originally reported in Petit et al 1997 & 1999, both in Nature. (e.g. https://doi.org/10.1038/20859)

According to CrossRef, Petit et al (1999) has been cited 3901 times. According to CDIAC, the Barnola dataset has been cited exactly once.

jhpoelen commented 3 years ago

@cboettig interesting!

You are describing claims of provenance. Can you please be a little more specific and use content based identifiers to describe which datasets you are referring to?

jhpoelen commented 3 years ago

@mbjones would you say that DataONE made a claim of provenance relating to the specific dataset in their metadata records?

cboettig commented 3 years ago

@jhpoelen Precisely! The dataset in question is described by the content identifier @mbjones already uses, hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37.

The citation shown by get_citation() above returns a link to this URL: https://search.dataone.org/view/ess-dive-8c0779e4f3ed341-20180716T234812410 (which apparently has no associated DOI).

My first contention is that I think this is the wrong ESS-DIVE entry for the content in question, I would have thought get_citation() would point to https://data.ess-dive.lbl.gov/view/doi:10.3334/CDIAC/ATG.009 (which also happens to have a DOI).

It is easy to verify that the second ESS-DIVE dataset landing page contains a download link (https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542) which we can still verify corresponds to the data in question: hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37, while the first does not. However, I think it would be too hasty to use that as the criteria to argue which of these two ESS-DIVE entries is the "correct" citation. In other cases, it may be possible that both contain the object, and it may be possible that the "correct citation" does not contain a download link to the object at all. I think we all agree that that in principle, it is better to refer to explicit metadata declarations of the 'citation', than guess by download link, which is what I believe Matt's code does -- however, in this case, it appears to me the database metadata declaration is wrong; or at very least, confusing.

Personally, I don't feel we have a working definition of "correct citation" in the first place. My point in quoting citation statistics was to suggest that many researchers who have almost surely re-used this famous dataset, (including those who created it!), consistently cite Petit 1999 https://doi.org/10.1038/20859, and not the ESS-DIVE data entry https://doi.org/10.3334/CDIAC/ATG.009. I hesitate to suggest that this is 'wrong', only that "citation" is not as semantically precise a notion as we might want it to be.

@jhpoelen did that answer the question? I know we could use content-identifers to refer to the many URLs I mentioned, which correspond to "html landing pages", "DOI redirects", "download URLs"; we could also use them to describe PDF files of manuscripts being cited, or the EML metadata files rather than the HTML landing pages. However, I'm not sure if that's what you had in mind, and I am not sure that's a good idea either. (it might be an okay way of describing web-architecture involved, but I don't think it gives much clarity to the discussion of citation -- which might be better though of as an 'abstract concept' rather than a particular set of 'content').

Maybe I should not yet have brought up the additional provenance issues until we resolve this technical issue of the 'canonical citation', but I think they are inter-related complications as well.

I completely see the value proposition in cite(), but I remain a bit skeptical in just how easy it is to solve computationally. Describing citations means going beyond "content" to the vast world of abstract concepts associated with content and all the fuzziness it involves. Researchers use the idea of "citation" inconsistently and imprecisely to accomplish a broad range of objectives, including many tasks that maybe only loosely correlated with the use. I am reluctant to propose an algorithmic solution to an issue that traditionally not only uses fuzzy subjective human judgement calls, but may always require it.

mbjones commented 3 years ago

@cboettig That all resonates.

As I said, my implementation is neither complete nor correct -- it was a proof of concept that I could walk from some contentid to some authority-based data package on DataONE (and only DataONE). I am definitely not asserting that it is the correct citation. I would argue that it is a correct citation, in that the package identified does in fact contain a copy of the vostok file with that hash. Interestingly, the revised version of that package does not contain that file, its only in the older version. In addition, the ESS-DIVE dataset at https://search.dataone.org/view/ess-dive-41e80536101cd69-20180726T183604969024 contains a copy of the vostok file as well, but my implementation only printed out the first of the associated citations -- As my code comment indicates, I think we should return all of them, not just the first. I also note that the DOI that ESS-DIVE assigned is not used for the main authority identifier in DataONE, and so we'd need to know to use the "Alternative Identifier" field from EML to get the DOI. There's a lot of variablity on how repos attach DOIs to packages.

Regarding "Petit 1999 https://doi.org/10.1038/20859", that is a paper, not a dataset per se. And so while that might have been the preferred "citation" for the dataset, and Nature even has a 61-page supplement file with a very poor format of the data polluted by page numbers and Nature logos, it is not present on DataONE, and so my function would have not picked it up. I would argue that the PDF wouldn't have the same hash anyways, and so the contentid we searched for is not actually present on the Nature site -- a different version of the data is there.

Currently, I think there is no universal rule for what is the canonical citation for a data file with a given hash. But a list of citations to datasets that contain that hash is still useful.

mbjones commented 3 years ago

@jhpoelen wrote:

@mbjones would you say that DataONE made a claim of provenance relating to the specific dataset in their metadata records?

I would argue that the ESS-DIVE repository published some metadata and data on their repository, and replicated the metadata to the DataONE site. DataONE redistributed the information provided by ESS-DIVE. I don't know what kind of provenance claim that represents.

mbjones commented 3 years ago

Thinking about this a bit further, I think the following provenance statements (among others) would be reasonable to make about the relationships between the two datasets containing vostok data that are registered in DataONE, and their relationship to the Petit 1999 paper:

@prefix schema: <http://schema.org/> .
@prefix prov: <http://w3.org/ns/prov#> .
@prefix dataone: <https://dataone.org/datasets/> .
@prefix doi: <https://doi.org/> .

dataone:ess-dive-41e80536101cd69-20180726T183604969024 a schema:Dataset ;
    schema:identifier "ess-dive-41e80536101cd69-20180726T183604969024" ;
    schema:identifier doi:10.3334/CDIAC/ATG.009;
    prov:wasDerivedFrom doi:10.1038/20859 .

dataone:ess-dive-8c0779e4f3ed341-20180716T234812410> a schema:Dataset ;
    schema:identifier "ess-dive-8c0779e4f3ed341-20180716T234812410" ;
    prov:wasDerivedFrom doi:10.1038/20859 .

doi:10.1038/20859 a schema:CreativeWork ;
    schema:identifier doi:10.1038/20859 .

cboettig commented 3 years ago

@mbjones cool, I like this! The prov statements you assert above make sense to me! Curious what @jhpoelen thinks of that approach?

Also as you point out, there are other statements that could be made as well (other papers, the fact that the vostok.co2 file in question, hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37, is contained in both datasets, etc). In this particular example, the relationship between the two schema:Datasets seems particularly murky, though that probably isn't entirely untypical of other real world cases. (it's also interesting that ess-dive-8c0 one contains vostok.co2.old as well, another implicit prov relationship).

I do worry about the programmatic assertions of what is "a citation" for some content identifier somehow become normative; researchers may too easily ignore the "among others" part if we don't provide them.

On the other hand, one take-away from this might be that it is more satisfactory for cite() to return this richer provenance assertions in RDF than it is to give a nicely formatted citation. This gets us back to Jorrit's question about offline access, and our long-standing issue of contentid's internal storage model, https://github.com/cboettig/contentid/issues/5.

To me, handling citations as just one more form of additional assertions we make about our content using schema / DCAT2 RDF feels appealing; it's more flexible/expressive than attempting to associate each content hash with a single bibtex entry, and captures use cases of other metadata / provenance issues that are equally compelling.

However, I'm reluctant to bolt on a full prov-triplestore tool to contentid just yet. I have a very crude prototype over at https://github.com/cboettig/prov, mostly aimed so far at constructing DCAT2 / schema:Dataset annotations, but could maybe be a tool for consuming them too. All the same, I think trying to add a RDF-based storage model inside of contentid feels inverted to me (i.e. that prov would be a dependency of contentid one day). contentid is a small simple package (though still not as streamlined as it ought to be!) and could easily be used by some larger piece of software which provides some RDF-based backend for managing citations and any other metadata you want associated with your content identifiers. But I'm reluctant to lock in a mechanism to do manage citations within contentid.