SciCatProject / scicat-backend-next

SciCat Data Catalogue Backend
https://scicatproject.github.io/documentation/
BSD 3-Clause "New" or "Revised" License
20 stars 24 forks source link

Add ability to describe published data according to standard schema #1192

Open paulmillar opened 7 months ago

paulmillar commented 7 months ago

Add ability to describe published data according to standard schema

Summary

SciCat has the concept of published data; that is, a set of one or more datasets that, collectively, are described by certain metadata fields. This metadata description is stored as a MongoDB document with the class PublishedData.

The backend has the ability to map this information to DataCite's XML schema, but only does this when making DataCite API requests for DOI activity. This ability to map PublishedData to a corresponding DataCite XML description isn't exposed by a SciCat API.

Perhaps because of this lack of exposing the DataCite description, the oai-provider-service reimplements the same mapping functionality (albeit not completely consistently). OAI-PMH also provides a Dublin Core description (as require by the OAI-PMH specification), which might also be useful under different circumstances.

Steps to Reproduce

When minting a DOI, SciCat backend generates XML that conforms to DataCite XML schema. OAI-PMH does the same, when querying the OpenAIRE (/openaire/oai) OAI-PMH endpoint.

Current Behaviour

Any client that wishes to generate a standards-compliant description of a PublishedData document needs to implement the mapping itself. This implies duplication of effort.

Should the PublishedData class be extended, so additional metadata is recorded (e.g., ORCIDs) and that additional metadata can be included in some standard metadata description (e.g., DataCite) then all service that generate that metadata description would need to be updated (e.g., DOI minting, OAI-PMH).

Expected Behaviour

The PublishedData API endpoint is extended to support querying for a description of a specific PublishedData document. This API extension would likely take two arguments: the metadata standard (e.g., Dublin Core, DataCite, Schema.org, ...) and the serialisation. In some cases, only one serialisation makes sense (e.g., DataCite and XML), but in other cases there may be multiple possible serialisations (e.g., Schema.org as JSON-LD,Turtle, RDF/XML, N3 ...).

The backend DOI minting activity would take advantage of this ability (although it might not issue HTTP requests) when generating the XML metadata for DataCite. The OAI-PMH interface could talk with the backend, rather than querying the MongoDB directly. The landing page could take advantage of this when including a Schema.org/JSON-LD description of the published data.

These would be natural places where this new API could be used (there may be others). I suggest this issue is closed when the extended PublicData API is available; ancillary issues should be opened against other SciCat components to track progress in adopting the new API (as appropriate).

Extra Details

This issue is the result of discussion on issue #1175. Some of the comments there are useful for this issue.

paulmillar commented 2 days ago

To provide more precise pointers:

As an example of inconsistencies: concept published-data.controller.ts openaire-mapper.ts
description[@type=Abstract] abstract dataDescription
creator/{givenName, familyName} split on first space in name; first term is givenName, rest is familyName These fields are not provided.
creator/creatorName familyName, givenName (e.g., Millar, Paul) creator (e.g., Paul Millar)
resourceType[@Dataset] resourceType This field is not provided.