IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
860 stars 481 forks source link

Dataverse should return machine-readable metadata to requesting clients/servers (content negotiation) #3699

Open jggautier opened 7 years ago

jggautier commented 7 years ago

Upon request from other machine clients and servers (e.g. other archives) accessing datasets through their persistent identifiers, Dataverse should be able to provide dataset metadata in available formats (JSON, DDI, etc.).

This is number 10 of the 11 recommendations made in A Data Citation Roadmap for Scholarly Data Repositories (https://doi.org/10.1101/097196).

pdurbin commented 7 years ago

@landreev doesn't our Harvesting (OAI-PMH) implementation already do some content negotiation?

The SWORD spec talks about content negotiation but in practice our implementation of SWORD is very simple and for files the only content we accept is the one which is required by SWORD, which is a zip file. We require that SWORD clients uploading files to Dataverse to send this header: "Packaging: http://purl.org/net/sword/package/SimpleZip" as mentioned at http://guides.dataverse.org/en/4.6.1/api/sword.html

pdurbin commented 7 years ago

@jggautier does Harvesting count?

jggautier commented 7 years ago

@landreev very helpfully provided context when I was trying to understand the difference between this and the harvesting Dataverse does now, so he's letting me post his comments on it :)

Content negotiation is a mechanism that allows clients and servers (as in, non-human, machine clients) to agree on the communication format/protocol that they both understand. In this context, the server would assume by default that the client is a web browser, with a human user, and send them to the default landing page. A machine client from another archive would send an additional flag in the request saying "I'm a dataverse harvester, I understand the following metadata formats, ordered by preference: JSON-LD, DDI, Dublin Core"; and the server will output the metadata in one of the formats, if available, or "sorry, this content is not available in any of the formats you requested". This may be possible to implement using the already existing, standard "Accept:" http header; or maybe a special flag would need to be designed just for this purpose... that's probably more technical than you need at this point/than I can talk about comfortably without reading up on it some more.

Dataverses already do something of this nature when they harvest from each other. And we've been thinking about extending this content negotiation mechanism further. (at this point a dataverse client says to the dataverse server "I understand the Dataverse Astronomical Sci metadata block" - it should really say "I understand the ... block, version NNN" - because we've realized that the blocks are going to keep getting modified as people use them...

This type of content negotiation seems a lot more flexible.

pameyer commented 7 years ago

One potential difference between this and harvesting is that there may be an assumption that the content-type negotiation is happening at the dataset landing page, instead of a separate harvesting endpoint.

pdurbin commented 7 years ago

@jggautier so what would you consider "definition of done" to be for this issue? I think we could easily argue that Dataverse already meets the recommendation. We could write it up in the User Guide if you want. In addition to Harvesting, we have Export in various formats that are machine readable. The standards-based ones are DDI and Dublin Core: http://guides.dataverse.org/en/4.7/admin/metadataexport.html

jggautier commented 3 years ago

Sorry for this very late reply. Guess I didn't understand enough back then, and still have some questions.

A Data Citation Roadmap for Scholarly Data Repositories recommends that "data repositories and identifier service providers such as identifiers.org or DataCite in addition may implement content negotiation for the persistent identifier expressed as HTTP URI, returning machine readable metadata in various formats." The article uses DataCite's implementation as an example:

curl -LH "Accept: application/ld+json" http://doi.org/10.5061/DRYAD.8290N returns DataCite's Schema.org JSON-LD metadata for that dataset, which is published in the Dryad repository. (See DataCite's page on content negotiation for more info.)

This already works for Dataverse-based repositories that publish datasets with DataCite DOIs. So systems can use this content negotiation to get metadata about datasets with DataCite DOIs published in Dataverse-based repositories. But what's returned is the metadata that DataCite publishes. This doesn't work for getting the metadata that the Dataverse repository publishes. For example:

Systems could use Dataverse's API or OAI-PMH, but in general the value of the kind of content negotiation that the article recommends is that it's standardized and more stable, right, while systems' APIs might be organized differently from each other and could change over time? And OAI-PMH supports metadata in only XML, while this type of content negotiation allows for metadata in any format, like the JSON in the Schema.org examples above.

These are the questions I'd ask to help define the "definition of done" for this issue:

jggautier commented 3 years ago

I've been emailing the article's corresponding author Tim Clark, who's looking into the questions in the last comment.

This has also been discussed in the context of tools for assessing the "FAIR"ness of datasets, as part of the FAIRsFAIR project.

hvdsomp commented 3 years ago

One can't implement content negotiation for URIs that are not under one's control. So, Dataverse can not implement content negotiation for a DOI HTTP-URI because it doesn't control those DOI URIs. DataCite and CrossRef can (and do) and in doing so allow access to metadata about the metadata they have about a DOI-identified object.

Signposting offers (among others) a way to get to metadata about the object that is available at the end of a (Dataverse) repository:

jggautier commented 3 years ago

Thanks @hvdsomp. You wrote that "One can't implement content negotiation for URIs that are not under one's control." That's an incredibly helpful way to put it. It doesn't seem like this is a recommendation that data repositories can actually implement then, right?

We can encourage the people who do control those URIs but haven't implemented content negotiation to implement content negotiation. I'm not sure how Handles work differently than DOIs, but there are at least 7 Dataverse repositories using them, and I'm not sure if content negotiation works for their Handle URIs. curl -LH "Accept: application/ld+json" https://hdl.handle.net/11529/10548581 doesn't seem to work.

Does anyone keeping an eye on this Github issue know more about Handles or know someone who knows more? I'll wait a week before asking in other channels (Dataverse Google Group, Code4Lib mailing list, emailing admins of repositories using Handles).

jggautier commented 2 years ago

I asked in the Dataverse Google Group but haven't had any replies, yet.

At @pdurbin's suggestion I also posted questions in the PID Forum, where I also referenced an older post in that forum that makes me question my understanding of this tenth recommendation and of content negotiation in general.

hvdsomp commented 2 years ago

The technology underlying handles and DOIs is the same, or, to put it differently, DOIs are handles. But organizations like CrossRef and DataCite have implemented a lot of functionality on top of DOIs, including content negotiation with the DOI-HTTP-URI as a means to obtain metadata in various formats, see e.g. https://www.crossref.org/documentation/retrieve-metadata/content-negotiation/.