CSCfi / metadata-submitter

Metadata Submission Interface for SDA
https://metadata-submitter.rtfd.io
MIT License
3 stars 2 forks source link

Map schemas between Metax Dataset schema and SD Submit schemas Datacite, Study and Dataset #327

Open genie9 opened 2 years ago

genie9 commented 2 years ago

Description

Metadata-submitter's Datacite, Study and Dataset schemas need to be mapped to Metax Dataset schema: https://raw.githubusercontent.com/CSCfi/metax-api/master/src/metax_api/api/rest/v2/schemas/att_dataset_schema.json.

Tasks

DoD

Testing

genie9 commented 2 years ago

Had a conversation on possible field mappings with @heikkil, @fmorelloCSC, and @blankdots.

Already mapped fields:

SD Submit datacite or object Metax research_dataset
DOI preferred_identifier
title (object) title
description (dataset) description
abstract (study) description
"CSC Sensitive Data Services for Research" publisher
creators creator
"restricted" access_rights

Will be mapped with this ticket:

SD Submit datacite or object Metax research_dataset
dates Updated modified
dates Issued issued
dates Collected temporal
keywords keyword
alternateIdentifiers other_identifier
contributors (other than Rights Holder, Data Curator, Distributor) contributor
contributors Rights Holder rights_holder
contributors Data Curator curator
type (study/dataset) theme
language language
geoLocations spatial
sizes (MUST standardize to bytes) total_remote_resources_byte_size

Need some clarification:

Fields that could be implemented with further SD Submit versions:

SD Submit

metadata_version_identifier version_info version_notes bibliographic_citation
provenance

genie9 commented 2 years ago

@fmorelloCSC, @heikkil, @blankdots Datacite Dates related pondering from Metax perspective: Dates are submitted as an array with each date having date type eg. Issued, Updated, Collected. Issued and Updated are mapped as a date string in Metax but SD Submit treats them as arrays thus the same date_type can be present several times. Is it something that can happen deliberately or would it be a mistake?

genie9 commented 2 years ago

@blankdots,, @fmorelloCSC, @heikkil Schema reference to datacite subjects https://support.datacite.org/docs/datacite-metadata-schema-v44-recommended-and-optional-properties#6-subject does not force the use of any specific schema for the field of science. Could it be possible to just take into use the schema used by Metax https://metax.fairdata.fi/es/reference_data/field_of_science/_search?pretty=true&size=100?

blankdots commented 2 years ago

@genie9

Could it be possible to just take into use the schema used by Metax

Why not both ?

Issued, Updated, Collected. Issued and Updated are mapped as a date string in Metax but SD Submit treats them as arrays thus the same date_type can be present several times. Is it something that can happen deliberately or would it be a mistake?

That is derived from Datacite Schema and it seems that is allowed, so it is deliberate

blankdots commented 2 years ago

about

Could this be link to REMS?

we will need to do integration to REMS/SD-Apply in this https://github.com/CSCfi/metadata-submitter/issues/291 so that we can generate the workflow needed for that link. Relevant info on that is available at: https://github.com/CSCfi/rems/blob/master/docs/linking.md#linking-into-a-new-application

The end link will look like: https://rems-demo.rahtiapp.fi/apply-for?resource=<datacite_doi> where <datacite_doi> is the URL of the datacite DOI for the dataset

genie9 commented 2 years ago

Could it be possible to just take into use the schema used by Metax

Why not both ?

Mainly just not to create overhead with too many formfields on same subject, especially where metax related is the one WE need mostly. But is there some history why current FOS classification were chosen in the beginning for datacite subjects?

blankdots commented 2 years ago

But is there some history why current FOS classification were chosen in the beginning for datacite subjects?

it is default Datacite. ok, then feel free do a PR and propose the necessary changes

genie9 commented 2 years ago

But is there some history why current FOS classification were chosen in the beginning for datacite subjects?

it is default Datacite. ok, then feel free do a PR and propose the necessary changes

OK... I will do that.

genie9 commented 2 years ago

Issued, Updated, Collected. Issued and Updated are mapped as a date string in Metax but SD Submit treats them as arrays thus the same date_type can be present several times. Is it something that can happen deliberately or would it be a mistake?

That is derived from Datacite Schema and it seems that is allowed, so it is deliberate

@blankdots,, @fmorelloCSC, @heikkil

Then another question arises: Metax takes in for issued and updated only one date. Should we then use for:

blankdots commented 2 years ago

Then another question arises: Metax takes in for issued and updated only one date. Should we then use for:

* `issued` chronological first appearance
* `modified` chronological last appearance

imo that seems reasonable.

genie9 commented 2 years ago

@heikkil, @fmorelloCSC, and @blankdots New updates on fields and mapping possibilities

SD Submit datacite or object Possible Metax research_dataset field Consideration
type (study/dataset) theme Cannot be mapped as is as Metax has predefined collection from http://www.yso.fi/onto/koko/. I think we should drop this mapping.
language language Cannot be mapped as is as Metax has predefined collection from http://lexvo.org/id/. Lexovo describes over 7k languages and so it's a huge collection. We have just over 200 enums now in Datacite schema. Should we just drop language mapping?
dates Updated / Issued / Collected modified / issued / temporal have to be validated to format YYYY-MM-DD
contributors Rights Holder rights_holder This could be an organization in the future but now is mapped as Person
sizes total_remote_resources_byte_size sizes schema is an array on Submitter side and integer on Metax side. The submitter will be able to provide file sizes after integration with SD Connect or other file upload service
genie9 commented 2 years ago

The easiest mappings have been merged to the metax-integration branch. There is still work with fields:

which looks like are doable but need some more attention.

These will be added with separate PRs in near future.

genie9 commented 2 years ago

The field in Metax remote resource cannot be used as a link to SD Apply as the dataset access rights need to be open in that case. https://wiki.eduuni.fi/display/cscfairdata/REMS+in+Sensitive+Data+Service

@teemukataja @juhtornr Disclaimer: the datasets which were added by hand to Etsin (e.g. https://etsin.fairdata.fi/dataset/335a6e92-5366-473a-b239-f9e52f204f9d) have the link to SD Apply, but it is a bug https://jira.eduuni.fi/browse/CSCFAIRMETA-1453