SciCatProject / scicat-backend-next

SciCat Data Catalogue Backend
https://scicatproject.github.io/documentation/
BSD 3-Clause "New" or "Revised" License
20 stars 21 forks source link

No standard mapping from SciCat metadata to DataCite metadata #1175

Closed paulmillar closed 5 months ago

paulmillar commented 5 months ago

SciCat data model has various fields that can describe a dataset. This forms a kind of metadata standard when describing a dataset.

DataCite maintain a metadata standard that may be used when describing a dataset. Perhaps the most prominent use is when creating DOIs, but various OAI-PMH endpoints support clients querying for a DataCite description of their records.

Currently (to the best of my knowledge) SciCat does not provide a canonical mapping of SciCat metadata to DataCite metadata. The closest is a partial support for such a mapping in oai-provider-service module, which provides a (limited) datacite-based description of datasets.

It would be helpful if SciCat provided a standard mapping from the SciCat data model to the DataCite metadata standard.

Simply documenting the mapping would be a good start, but perhaps the REST API could be extended so that it provides as a way for a client to obtain a DataCite description of a given dataset.

Such support would make it easier to mint DOIs, as support for obtaining a DataCite description of the dataset would be provided by SciCat.

sbliven commented 5 months ago

Does it make sense for the core backend to provide DataCite, or would it be sufficient to expand the oai-provider-service to provide more complete support?

paulmillar commented 5 months ago

Hi @sbliven,

I don't have a strong opinion, here, but would share some thoughts and observations.

The primary goal for this issue is to make it easier to mint a DOI. Ideally, SciCat would provide a standard place (within the SciCat data model) for all DataCite metadata fields, so facilities don't require any customisation.

For minting DOIs, the metadata description (of a dataset) supplied to DataCite could come from OAI-PMH. Moreover, I believe there are generic DOI-minting services that acquire the DataCite metadata description by querying an OAI-PMH endpoint.

The current oai-provider-service seems to query the MongoDB directly, without using the SciCat data model classes. If so, then there's a risk that changes to the underlying MongoDB storage layout could break OAI-PMH support unless a corresponding change is made to oai-provider-service (making combining these services somewhat fragile).

There are other potential scenarios that would benefit from having an easy/supported way of generating a description (of some dataset) that conforms to some external metadata standard. For example, a DOI landing page might include a JSON-LD description of a dataset based on Schema.org, to support harvesting of dataset.

So, I'm wondering whether backend would be updated to provide a REST API where the client can request a description of a dataset according to some external metadata standard (Dublin Core, DataCite, Schema.org, etc). The oai-provider-service could be refactored, to take advantage of this API. Under this approach, the same API could also be used when minting DOIs.

In some sense, such an interface would involve migrating some of the oai-provider-service functionality into backend, to allow more direct reuse in other scenarios.

Just to be clear: updating the oai-provider-service should be fine, but it might not be the best long-term approach.

nitrosx commented 5 months ago

@paulmillar if I understand correctly, you are suggesting to extend BE with an endpoint that provides item requested in the format requested where the field mapping between format and SciCat entity is configurable although a default configuration should be provided. Is that correct?

Marking this issue as "possible feature" and "needs dicussion"

paulmillar commented 5 months ago

Hi @nitrosx,

Sorry, the discussion perhaps went a little off on a tangent.

Just to be clear, the goal would be to have a standard mapping from the SciCat data model to the DataCite metadata schema.

This goal could be achieved by having a document (a text file, markdown, CSV, ODT, ...) that describes, for each DataCite metadata schema element (at least, those of interest) where that information is available from within the SciCat data model.

For example:

    DataCite:Title <--- SciCat DatasetClass.datasetName (or SciCat ProposalClass.title ?)
    DataCite:size <-- SciCat:DatasetClass.{size, numberOfFiles} or SciCat:DatasetClass.{packedSize, numberOfFilesArchived}

The idea is that, if this mapping were to be standardised then DOI minting could be supported without requiring any facility-specific configuration or behaviour; i.e., out-of-the-box support for DOI minting --- just configure DataCite credentials.

Instead of writing down the mapping as a human-readable document, the mapping could be described in some machine actionable way (e.g., written in TypeScript). The API idea (allowing a client to ask backend for a DataCite description of a dataset) is just one way to use such a machine actionable description.

I hope that explains the issue a little better.

nitrosx commented 5 months ago

@paulmillar would you like to contribute the first document in a PR? Should we place this mapping in the documentation?

paulmillar commented 5 months ago

A quick update.

Previously, I hadn't appreciated that the metadata describing a dataset (used for DOI minting and OAI-PMH) is a separate class: PublishedData and not from Dataset. Seemingly, some data is copied across from Dataset or Proposal in order to pre-fill some fields, but the user (requesting a DOI) can modify those values.

I've gone through the oai-provider-service and here are the mappings for Dublin Core (as encoded as an OAI-PMH record):

<oai_dc:dc xmlns:oai_dc='http://www.openarchives.org/OAI/2.0/oai_dc/' xmlns:dc='http://purl.org/dc/elements/1.1/'>
    <dc:title>{{record.title}}</dc:title>
    <dc:description>{{record.dataDescription}}</dc:description>
    <dc:identifier>{{record[this.collection_id]}}</dc:identifier>
    <dc:identifier>{{process.env.BASE_URL + "/detail/" + encodeURIComponent(record[this.collection_id])}}</dc:identifier>
    <dc:date>{{record.publicationYear}}</dc:date>
    <dc:creator>{{record.creator}}</dc:creator>
    <dc:type>dataset</dc:type>
    <dc:publisher>{{record.publisher}}</dc:publisher>
    <dc:rights>Available to the public.</dc:rights>
</oai_dc:dc>

and DataCite (again, as encoded as an OAI-PMH record):

<datacite:resource xmlns:datacite="http://datacite.org/schema/kernel-4">
    <datacite:titles>
        <title>{{record.title}}</title>
    </datacite:titles>
    <datacite:identifier identifierType="URL">https://doi.org/{{record[this.collection_id].toString()}}</datacite:identifier>
    <datacite:descriptions>
        <description descriptionType="Abstract">
            {{record.dataDescription}}
        </description>
    </datacite:descriptions>
    <datacite:dates>
        <datacite:date dateType="Issued">2020-01-01</datacite:date>
        <datacite:date dateType="Available">2020-01-01</datacite:date>
    </datacite:dates>
    <datacite:publicationYear>{{record.publicationYear}}</datacite:publicationYear>
    <datacite:creators>
        <creator>
            <creatorName>{{record.creator}}</creatorName>
            <affiliation>{{record.affiliation}}</affiliation>
        </creator>
    </datacite:creators>
    <datacite:publisher>{{record.publisher}}</datacite:publisher>
    <datacite:rightsList>
        <datacite:rights rightsURI="info:eu-repo/semantics/openAccess">OpenAccess</datacite:rights>
    </datacite:rightsList>
</datacite:resource>

In both cases, record is instance of PublishedData, filtered with record.status == "registered"

The DOI minting seems to use (simplifying slightly):

<resource xmlns="http://datacite.org/schema/kernel-4">
    <identifier identifierType="doi">${doi}</identifier>
    <creators>
        <creator>
            <creatorName>${lastName}, ${firstName}</creatorName>
            <givenName>${firstName}</givenName>
            <familyName>${lastName}</familyName>
            <affiliation>${affiliation}</affiliation>
        </creator>
        <!-- Repeated, as needed -->
    </creators>
    <titles>
        <title>${title}</title>
    </titles>
    <publisher>${publisher}</publisher>
    <publicationYear>${publicationYear}</publicationYear>
    <descriptions>
        <description xml:lang="en-us" descriptionType="Abstract">${abstract}</description>
    </descriptions>
    <resourceType resourceTypeGeneral="Dataset">${resourceType}</resourceType>
</resource>

The part that's missing is which parts of PublishedData are pre-populated from existing elements (e.g., is PublishedData.abstract pre-populated with Proposal.abstract?).

Given this, instead of adding a new document, perhaps it would make more sense to update published-data.schema.ts so it records the semantic meaning of the fields in terms of their Dublin Core and/or DataCite Metadata equivalent (using Swagger description annotation?)

nitrosx commented 5 months ago

@paulmillar thank you for the great investigative work. I would need to check the publish-data subsystem and see how to insert the mapping. Do you think that it will need to change frequently? ...or an institution would like to customize the field mapping?

paulmillar commented 5 months ago

@nitrosx One of my hopes here is that we can avoid facility-specific customisation. New features would likely be optional (not all facilities would support it), but if SciCat provides a "standard way of doing something" then it becomes easier for facilities to adopt that standard approach.

To give a concrete example, with schema v4.5, DataCite introduced more complete support for DOIs as instrument PIDs. If instruments at a facility were to have DOIs (something SciCat might or might not have a role) then we would want a Dataset DOI to have DataCite metadata that links it to the corresponding instrument DOI.

Similarly, published data is currently missing the possibility to include ORCID and ROR identifiers in the DataCite metadata description.

So, I would imagine, over time, there would be a slow and steady improvement in the metadata that SciCat provides.

In the short term, I think simply providing better documentation on the semantics of the different properties (e.g., OpenAPI/Swagger docs) would be a win.

paulmillar commented 5 months ago

My proposal is that we close this issue is closed with pull request #1189. This pull request documents the mapping between SciCat data model (and PublishedData, specifically) and the corresponding Dublin Core and DataCite schemata ... so, job done.

We can then open another issue specifically about extending the PublishData API to support clients requesting a description of the dataset under different metadata models (e.g., DataCite, Dublin Core, Schema.org). That new issue could refer to this issue (citing it as inspiration), so the above comments are not lost.