IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 494 forks source link

Some Dataverse metadata fields seem not to be indexed correctly by DataCite #7072

Open philippconzett opened 4 years ago

philippconzett commented 4 years ago

Recently, I minted a DOI for a sub-dataverse / collection within DataverseNO using the DataCite Fabrica service (https://doi.datacite.org/). Accidentally, I discovered that some Dataverse metadata fields seem not to be harvested/indexed correctly by DataCite. Here is how I discovered this issue: In the DOI section of DataCite Fabrica, I selected the DataCite metadata record of an existing dataset which was published in DataverseNO. I clicked the Update DOI (Form) button to see the details of the metadata record. Scrolling through the DataCite metadata record and comparing it with the metadata record of the corresponding dataset in DataverseNO, I noted the issues below. I guess they are due to a) issues in Dataverse, or b) issues in DataCite, or c) a combination of (a) and (b). In the case of (a), I suggest that there be opened separate GitHub issues for each issue.

REQUIRED PROPERTIES Affiliation: According to the help text, Affiliation names and identifiers are provided by the Research Organization Registry (ROR). I suggest that affiliation fields and other Dataverse metadata fields (potentially) containing the name of a research organization also fetch their values from ROR.

Resource Type General: The default Resource Type General for resources published in a Dataverse repository is Dataset. I suggest to introduce two more types. (1) The first one is Collection, which may be applied to (sub-)dataverses. Currently, it is possible to mint a DOI for a sub-dataverse, but only manually in DataCite Fabrica. I suggest that this feature should also be a built-in option when publishing a dataverse. (2) The second Resource Type we need is File (or Part of Dataset); see existing GitHub issue #5086.

RECOMMENDED PROPERTIES Subjects: No values are registered in this field. In a recent blog post, DataCite withes that they are using the OECD Fields of Science classification, which according to them is the most widely used generic classification scheme. The Dataverse community has previously discussed other vocabularies, including FAST (see this Dataverse Google Group post). Given the DataCite recommendations, I suggest that Dataverse goes for the OECD classification. I also suggest that once the OECD classification is adopted, there should be created a script that replaces the Subject values in existing datasets with corresponding OECD values.

Contributors: Here, I'd expect to find the values from the Dataverse Contributor field, but I only see two values: Contact person and Producer, whereas in the DataverseNO metadata record of the corresponding dataset there are two Contributor entries: Data Collector and Data Curator. Also, DataCite supports a Name Identifier, which "uniquely identifies an individual or legal entity, according to various schemas, e.g. ORCID, ROR or ISNI". I suggest, that Dataverse also introduces this support. See my comment above about ROR.

Geolocation: No values are registered in this field, whereas in the dataset in DataverseNO, both Geographic Coverage (Country = Norway) and Geographic Bounding Box (coordinates for Norway) are provided.

OPTIONAL PROPERTIES Language: No values are registered in this field, whereas in the dataset in DataverseNO, the field Language contains the value English.

Rights: No values are registered in this field, whereas in the dataset in DataverseNO, default CC0 is selected / left unchanged.

Version: No values are registered in this field, whereas the current version of the dataset in DataverseNO is V2.

Funding References: No values are registered in this field, whereas the corresponding dataset in DataverseNO has two entries in the field Grant Information.

jggautier commented 3 years ago

Some of this is related to https://github.com/IQSS/dataverse/issues/5889

mheppler commented 3 years ago

Related? Silent publishing failure when not all fields required by Datacite are present #7551

valentinapasquale commented 3 years ago

Hello @philippconzett, hello everybody, do you know if there is any plan (or open issue) about adopting the OECD Fields of Science classification as controlled vocabulary in the subject field of the citation metadata block? Thanks for the help!

qqmyers commented 3 years ago

FWIW: The general topic of supporting controlled vocabularies is being discussed in the Controlled Vocabulary Value (CVV) metadata working group. There will be a meeting next Thursday at 9 AM EDT to discuss the latest progress on this topic. To foreshadow a bit, the work to support connections to SKOSMOS servers that’s been done is a big part of what’s needed to support ‘any’ vocabulary and we’re trying to work out a concrete proposal for how one would associate a vocabulary with a given field, how Dataverse would store the value, and how display will be handled.

Using OECD Fields of Science in particular would then require hosting those terms on a SKOSMOS server somewhere (or supporting retrieval of them by some other means).

In any case – a plug for people interested in this to join us next Thursday! (https://harvard.zoom.us/j/95371438150, password is pinned in the general channel of dataversecommunity.slack.com which you can join upon request.

-- Jim

From: valentinapasquale @.*** Sent: Friday, March 19, 2021 7:08 AM To: IQSS/dataverse Cc: Subscribed Subject: Re: [IQSS/dataverse] Some Dataverse metadata fields seem not to be harvested/indexed correctly by DataCite (#7072)

Hello @philippconzetthttps://github.com/philippconzett, hello everybody, do you know if there is any plan (or open issue) about adopting the OECD Fields of Science classification as controlled vocabulary in the subject field of the citation metadata block? Thanks for the help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/7072#issuecomment-802752707, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABTLRTYQ77NXWOV6NN45XHLTEMWB5ANCNFSM4OWN66XA.

cmbz commented 9 months ago

2024/01/30 @philippconzett are you still encountering this problem?

pdurbin commented 9 months ago

@cmbz I'll let @philippconzett speak for himself but I'd say "send more data to DataCite" has broad support across the community.

For example, Philipp mentions rights above. The DataCite Commons entry for Harvard Dataverse shows how we don't send rights/license data at all:

Screenshot 2024-01-30 at 8 56 54 PM

He also mentions funding. I'm pretty sure the NIH would like to know which datasets they have funded. If I'm reading the API output right, Dryad has told DataCite about 264 datasets funded by the NIH. Because Dataverse doesn't send any funding information to DataCite, an equivalent search for Harvard Dataverse datasets funded by the NIH is zero. I got these API calls from slide 5 of a presentation by Matt Buys. See also some Slack discussion.

Anyway, that's just two examples. We already have some issues going for funding. I'm not sure about rights/licensing. Like Philipp suggested above, perhaps separate issues is the way to go. I'm pretty sure it's all GREI-related, given that DataCite is a full member.

cmbz commented 9 months ago

@pdurbin Right! I was thinking less generically "send more metadata to DataCite", which we plan to address substantively in GREI years 3 and 4 (as you mentioned), and more specific metadata fields that we could prioritize in the scope of that planned work. Since the issue is several years old, I wasn't certain if some of these elements had already been addressed.

philippconzett commented 9 months ago

Thanks, @cmbz and @pdurbin. I agree with Phil that solving these issues would be of high value for many if not all of our community members, since delivering complete and correct metadata to DataCite is at the core of making data findable.

pdurbin commented 9 months ago

To me it would make sense to create a few issues about the planned work before closing this one. That way people who are interested in these features can follow the new issues.

cmbz commented 9 months ago

@pdurbin and @philippconzett Right. Sorry for the confusion. I wasn't planning to close any issues without discussion. Just working to gather all outstanding to-do items on this topic into the GREI epics that are being defined so the work can be defined, planned, and worked on during Years 3 & 4.

jeromeroucou commented 1 month ago

We have also noticed in the Recherche Data Gouv repository that contributors are not present in the metadata viewable in Datacite, as indicated by @philippconzett for DataverseNO. However, we have more values than Data Collector and Data Curator. We taken all the values from Datacite's controlled list of contributorType. We also have a metadata not available on Datacite: Metadata author.

In the case of metadata not supported by Datacite, would it be better to indicate nothing, or to replace the value with Other?

qqmyers commented 1 month ago

FYI: #10632 does include contributor information in what is sent to DataCite. However, the code doesn't currently check for contributor types that are not in DataCite's controlled list and I expect the code as it is now will fail when submitting records to DataCite when a contributor type that is not in the DataCite list is used. Such a check could probably be added. I'd suggest that any types not in DataCite's list get mapped to Other rather than being dropped (contributor type is required in their 4.5 schema, so dropping would mean dropping the contributor overall and not just leaving the type blank).