SciCatProject / scicat-backend-next

SciCat Data Catalogue Backend
https://scicatproject.github.io/documentation/
BSD 3-Clause "New" or "Revised" License
20 stars 24 forks source link

Add techniques in DOI metadata #1518

Open paulmillar opened 5 days ago

paulmillar commented 5 days ago

Summary

The DataCite metadata standard is able to record the experimental technique used to establish the dataset. However, SciCat doesn't do this: so the DataCite metadata is lacking this information.

Note that, although SciCat can store the experimental technique information as dataset metadata, this information is not propagated to publishedDataset.

Steps to Reproduce

  1. create a dataset, including the experimental technique
  2. trigger publishing the dataset.
  3. observe DOI metadata; e.g., via DataCite API.

Current Behaviour

The DataCite metadata contains no subject elements.

Expected Behaviour

The DataCite metadata should contain subject element(s) that describe the techniques.

Details

The document ETN-1: Embedding PaNET in DataCite metadata describes how to include PaNET terms within the metadata associated with a DOI.

The document ETN-2: Working with PaNET terms in SciCat describes how to format PaNET terms within SciCat.

Note that (as described in #1192) the DataCite metadata is calculated in two places: scicat-backend-next's published-data.controller.ts and oai-provider-service's openaire-mapper.ts.

Arguably, there should be a single place (within SciCat code) that provides DataCite metadata (as described in #1192). While removing this duplicate code (i.e., closing #1192) would benefit this issue. I don't consider #1192 to block this issue.

nitrosx commented 4 days ago

@paulmillar thanks for opening the issue. Given that PublishedData can contains one or more datasets, what would you do if multiple datasets with different techniques are present? Would you add a list of techniques to publishedData and than propagate all of them to DataCite?

paulmillar commented 4 days ago

Hi @nitrosx,

Yes, this is certainly a valid question. I've spent a little time thinking about this, but haven't come to a strong opinion.

One could argue that each technique (of those techniques describing the publishedData) indicates that there's at least some data (within the publishedData data) taken with that technique. Under that interpretation the publishedData techniques would be the union of all techniques in its member datasets.

Alternatively, one could argue the publishedData techniques should describe all the datasets being published, since the publishedData is describing all those datasets. With this interpretation, the publishedData techniques is the intersection of all techniques in the member datasets.

Yet a third option is the selection is context-driven. Why is a DOI being generated? This might suggest some techniques (from the union) be included and other should be ignored. This would be a more nuanced approach, something that would likely require human input.

In practical terms, I would suggest taking the first option (use the union of techniques from member datasets) as an initial version.

A subsequent update could be to present the list of techniques in the web UI, to allow the user to choose/veto techniques, as appropriate.