ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
109 stars 31 forks source link

Avoid using DOI URLs for node identifiers #197

Open datadavev opened 2 years ago

datadavev commented 2 years ago

The JSON-LD 1.1 Processing Algorithms and API specification ^1 provides guidance on retrieval of JSON-LD over HTTP in the section Remote Document and Context Retrieval:

When requesting remote documents the request MUST prefer Content-Type application/ld+json followed by application/json.

If a resolution of a DOI is required, for example "@id":"https://doi.org/10.5066/F7VX0DMQ", then the resolved resource may not be the authoritative source. Requesting that resource as a web browser with a content priority of HTML results in the following resolution sequence (> indicates a request, < the response):

(request made with Accept: application/ld+json;q=0.7,application/json;q=0.6,text/html;q=0.9)

> GET: https://doi.org/10.5066/F7VX0DMQ
< 302 https://www.sciencebase.gov/catalog/item/56b3e649e4b0cc79997fb5ec

> GET: https://www.sciencebase.gov/catalog/item/56b3e649e4b0cc79997fb5ec
< 200 text/html;charset=utf-8

SUMMARY: Start URL: https://doi.org/10.5066/F7VX0DMQ
SUMMARY: Final URL: https://www.sciencebase.gov/catalog/item/56b3e649e4b0cc79997fb5ec

If instead the content-type application/ld+json is preferred, a different resource is resolved (with a failure in this case):

(request made with Accept: application/ld+json;q=0.7,application/json;q=0.6,text/html;q=0.2)

> GET: https://doi.org/10.5066/F7VX0DMQ
< 302 https://data.crosscite.org/10.5066%2FF7VX0DMQ

> GET: https://data.crosscite.org/10.5066%2FF7VX0DMQ
< 503 text/html; charset=UTF-8

SUMMARY: Start URL: https://doi.org/10.5066/F7VX0DMQ
SUMMARY: Final URL: https://data.crosscite.org/10.5066%2FF7VX0DMQ

This behavior may lead to unintended consequences. Hence, until this behavior is corrected by DOI resolvers it seems prudent to avoid using DOIs for node identifiers in linked data systems reliant upon reliable resolution of JSON-LD resources.

[Edit: added Accept request header values]

mbjones commented 2 years ago

Thanks @datadavev, seems challenging. So, this seems like a side-effect of overloading content negotiation to perform multiple roles in different contexts (e.g., for crossref as a way to trigger a metadata request API about an item rather than redirecting to the item itself). But the proposal to not use DOIs as the node identifiers seems problematic if that DOI is meant to be the long-term resolvable URI for the dataset. If the DOI is the only stable URI for a dataset, and represents the preferred identifier for the dataset, what should the node identifier be set to?

datadavev commented 2 years ago

The DOI resolver service (actually any identifier resolver) should respond with a resolver service metadata content representation only when specifically requested.

RFC 8288 provides a mechanism for advertising the availability of related resources, including alternate representations of a resource. The resolver can advertise an alternate representation of a resource by including such information in the HTTP response Link header.

At a minimum such information should be included when a resolver returns an alternate representation of a resource not specifically presented by the resource authority. Ideally, the DOI resolver should only respond with information about the location of the requested resource.

So the resolver returns the original representation by default and advertises availability of an alternate representation through a link header in the response. For example the DOI resolver could respond similarly to the following example, with the redirect response including a link to the location of the resolver metadata (fake example):

> https://doi.org/10.5066/F7VX0DMQ Accept:application/ld+json
< Link: <https://data.crosscite.org/10.5066%2FF7VX0DMQ>; rel="alternate"; type="application/ld+json"; profile="http://datacite.org/schema"
< 302 https://www.sciencebase.gov/catalog/item/56b3e649e4b0cc79997fb5ec

> https://www.sciencebase.gov/catalog/item/56b3e649e4b0cc79997fb5ec Accept:application/ld+json
< 200 application/ld+json

Note however that link header handling in redirect responses would only be available to programmatic clients. Intermediate communication details are not exposed to web browser based clients.

DataCite is aware of the issue, but a change in their resolving behavior may have other side effects that need to be considered.

smrgeoinfo commented 2 years ago

Perhaps the problem is interpretation of what the DOI identifies.

If the DOI identifies a dataset, then resolving the DOI should get a representation of the dataset-- e.g. a CSV, NetCDF, ESRI shapefile, i.e. some serialization of the dataset. We have accepted the notion that a landing page is a representation of a dataset.

The node identifier in JSON-LD identifies the node-- i.e. a JSON object. That JSON object might be about a dataset, in which case it is functionally analogous to a landing page, but in this case I'd argue that the node identifier identifies a particular representation (the JSON object) that is about the thing the DOI identifies. From that point of view the simple solution is to use a different identifier for the node.

datadavev commented 2 years ago

The @id property is a node identifier ^1, and dereferencing that URI should result in a representation of that node. The json-ld spec provides guidance on the process for dereferencing a URI for a json-ld resource ^2.

The problem is that when a client encounters a DOI for a node identifier and the guidelines are followed for dereferencing the node identifier, the resulting document is an unexpected alternative representation from CrossRef (at best) or an error condition. It is not the resource offered by the resource owner. This breaks the linked data expectations. Furthermore, it seems there's no way around this for json-ld resources other than to make the request in a manner inconsistent with the json-ld spec.

Note that requesting a different RDF serialization (e.g. application/n-quads) results in redirection to the expected location. Hence, the issue seems to be apparent only when requesting a json-ld representation of the identified resource.

To me at least, this behavior is problematic since the resolution service is subverting the resolution request and returning a resource that was not requested. The resolver should present a different rendering of the resource only when specifically requested (such as through a different API or through specific request parameters).

The solution is fairly straight forward, but it seems it does need to be implemented by DataCite. The alternative is for json-ld clients to implement custom behavior when dereferencing a DOI.