Direct access to content associated with a DOI

mfenner commented 6 years ago

As a user, I want to be able to download the content associated with a DOI without first going to a landing page, so that I can quickly download a large number of datasets.

DEV NOTES

content negotiation and link headers would be the way to do this, but have never been widely adopted
add a field "content URL" to DataCite URL registration, in addition to the "URL" field that we already have
only scales with a standard packaging format, and the best candidate in terms of functionality and adoption might be bagit.
recommend to provide the bagit item as "application/zip" content type.
register as content type "application/zip", expressed as URL that would be https://data.datacite.org/application/zip/10.x/xyz.
expose all DOIs in a sitemap file, optionally broken down in individual sitemap files for each data center.

mfenner commented 6 years ago

Possibly relevant: https://github.com/UTS-eResearch/datacrate by @ptsefton.

max-mapper commented 6 years ago

Data.gov does this using the 'resources' field of Project Open Data, which includes a 'downloadURL' https://project-open-data.cio.gov/v1.1/schema/#distribution-downloadURL. They specifically distinguish downloadURL from accessURL which I think is very helpful.

mfenner commented 6 years ago

Thanks @maxogden. Schema.org does something similar with http://schema.org/url and http://schema.org/contentUrl - the latter would need an update of the documentation, as it is relevant not only to media objects.

cameronneylon commented 6 years ago

As noted in an email from @mfenner it could be good to think about this from the (asymmetrical) view of packager as well as consumer. The question of who does the packaging and what their motivation is, may be quite important in terms of what quality of metadata and completeness can be provided.

As noted above the work on datacrate is interesting in this space. I wrote up some experiences on finding it challenging to package things on the blog (http://cameronneylon.net/blog/walking-the-walk-how-easily-can-a-whole-project-be-shared-and-connected/ http://cameronneylon.net/blog/packaging-data-the-core-problem-in-general-data-sharing/).

There are some good opportunities to think about how this might fit into a general workflow with provenance, metadata being created as the researcher goes.

sckott commented 6 years ago

@mfenner is there any scope here to include supplementary files associated with journal articles? Or only works that are datasets themselves?

mfenner commented 6 years ago

I think the focus is on DOIs for datasets, but the same process should work for other content types. Supplementary files would be good and I can talk to Figshare whether they are interested.

mfenner commented 6 years ago

PLOS is a special case, as they make heavy use of Crossref component DOIs for figures, tables and supplementary files. Will ask Crossref for advice.

dojobo commented 6 years ago

Perhaps sort of relevant, at European Southern Observatory we are using a Link header with rel="alternate" on our landing pages. (So far this is for a single resource URL for the DOI... I suppose multiple rel="alternate" Link headers are allowed?) This came from those FORCE11 recommendations I think. Besides being machine-readable, you can also fetch it with a HEAD request. (We use this internally to monitor, with a cron job, that landing pages and resource URLs resolve with a 200.)

An example: https://doi.org/10.18727/0722-6691/5053

mfenner commented 6 years ago

Thanks @dojobo. I think this aligns well with the recommendations at http://signposting.org/.

eocarragain commented 6 years ago

Also note the citation_pdf_url convention used by crawlers like Google Scholar and OADOI/Unpaywall, Core to harvest actual "data"/pdfs from publishers and institutional repositories. So, for example, a crawler may follow a Handle or CrossRef DOI to a HTML landing page and then look for a meta element in the HTML header to identify the primary bitstream associated with the resource

<meta content="https://my.repo.org/bitstreams/9999/mypaper.pdf" name="citation_pdf_url" />

signposting.org is a much better reference point, but worth noting this for completeness sake.

Google Scholar: https://scholar.google.com/intl/en/scholar/inclusion.html#indexing

mfenner commented 6 years ago

In schema.org the relevant attribute would be contentUrl.

eocarragain commented 6 years ago

See also draft recommendation of the RDA PID Kernel Information WG (https://www.rd-alliance.org/groups/pid-kernel-information-wg): https://docs.google.com/document/d/1EdS5OCoEWd4VY0HNLHkhzdQojsRgc3P8aWXYQKTqs8M/edit

Note especially:

digitalObjectLocation - "Pointer to the content object location (pointer to the DO). This may be in addition to a pointer to a human-readable landing page for the object"
etag - "Checksum of object contents. Checksum format determined via attribute type referenced in a Kernel Information record."

mfenner commented 6 years ago

@eocarragain thanks, this aligns with our thinking, and with the work we are doing on this in the NIH Data Commons.

eocarragain commented 6 years ago

It is also worth keeping content-addressed protocols like ipfs.io (and even things like magnet links) in mind. Main thing is to avoid a recommendation which rules these protocols out, as DOI/DataCite could potentially be a nice bridge to these p2p, content-addressed networks in that it provides a trusted mutable record which can point to immutable content-address. Having said that, I don't see anything above which would rule them out as long as the content-addressed protocol can be expressed as the URL scheme.

For IPFS, the DataCite contentURL could point to a single file or a huge directory of content using something like ipfs://babybeiccrv3uc3hjipdnwf4nnntbxuwvt4pn5dsgelvvyueucracbevtha . The content can then be retrieved from any ipfs node which has some or all of the content, and since the "content identifier" string contains the hash of the entire content, verification is built in as part of the protocol (links to some relevant specs: CIDs, Multiformats).

eocarragain commented 6 years ago

More generally, to what extent is providing a way to verify downloaded content a requirement (in scope) here, or is the goal only to provide a direct link to a resolvable download? It is somewhat covered by Bagit and by the etag attribute in the RDA PID Kernel document.

mfenner commented 6 years ago

We should clarify the scope of this issue, which is what we can provide via DOI metadata and DOI services. We are not planning to go beyond one or more URLs and checksums. The protocols for file downloads used, the verification of downloaded content, and also permissions are out of scope. One contentURLs as part of DOI metadata have become the norm, or at least seen significant uptake, we can start that discussion.

eocarragain commented 6 years ago

Sounds good & makes sense. Might be worth breaking this issue into user stories as is happening in recent issues. Cheers

pdurbin commented 5 years ago

Here's an example of how to get the URL for a file from Zenodo (using jq to show just the first file):

curl -s -H "Accept: application/ld+json" https://zenodo.org/api/records/1419226 | jq '.distribution[0]'

{
  "@type": "DataDownload",
  "contentUrl": "https://zenodo.org/api/files/149d8cde-076a-478a-a4df-26b061161c36/13.3.17A5_E9_36C_dataset.HDF5",
  "fileFormat": "hdf5"
}

Thank you @cboettig for linking to this conversation from https://github.com/whole-tale/whole-tale/issues/35#issuecomment-427397629

Hat tip to @jggautier for the mentioning the Zenodo record above at https://github.com/IQSS/dataverse/issues/4371

mfenner commented 5 years ago

Thanks @pdurbin. This is functionality provided by Zenodo, unfortunately the contentURL that they provide is not yet part of the DOI metadata they send. We will work with DataCite repositories to provide that information to DataCite so that we can include the content URL in the DOI metadata.

eocarragain commented 5 years ago

Also highlights whether the contentURL in the PID metadata should reference the whole thing (e.g. via the packaging formats reference above or simple tar/zip file) or whether an array of contentURLs for each file is allowed. My vote would be for the former (with bonus points if the package contains all relevant PID metadata too), but this could be harder to get adoption/consensus.

mfenner commented 5 years ago

My goal is a contentURL referencing a single file, ideally a bagit archive that also includes metadata. I have run into a use case where I need to support multiple contentURLs - the same content in multiple cloud locations (AWS, Google Cloud), but that is an edge case.

eocarragain commented 5 years ago

For reference, discussion of the "identifier for digital objects" PID schema being adopted by the Software Heritage archive: https://hal.archives-ouvertes.fr/hal-01865790v4 . Includes using hashes for ensuring the integrity of resolved content.

eocarragain commented 5 years ago

Also for reference, see MINIDs: http://minid.bd2k.org/. In addition to an array of locations/urls, the json response has fields for "checksum" and "checksum_function" and "content_key".

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

datacite / freya

Direct access to content associated with a DOI #2

DEV NOTES