IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 487 forks source link

Improve datacite metadata #8108

Open tcoupin opened 3 years ago

tcoupin commented 3 years ago

Currently, the following fields are sent when creating datacite DOI :

Other fields can be add or impoved with the current object model of datasets and datafiles: language, geolocations, related items and identifier for dataset...

kjgarza commented 3 years ago

Over the last couple of weeks, we have been analyzing the metadata of many DOIs created using Dataverse instances (22 different instances). We were looking specifically to DOI metadata fields such as subject, size, and format and we notice that even when this information to populate these fields is present in the landing pages of the resource ( see picture below for 10.7910/DVN/YHMQPS ) the public metadata deposit rarely included these metadata (see API response or Fabrica.

image

We found that more than 99% of DOIs metadata deposits (n=789,701) created with Dataverse didn't include these metadata fields (see charts below).

image

Improving this would be incredibly important. Hopefully, this work can be prioritized.

jggautier commented 3 years ago

I'm just commenting to mention that this issue is related to discussions in some other GitHub issues, including https://github.com/IQSS/dataverse/issues/5889.

marcomarsella commented 2 years ago

Currently, the following fields are sets when creating datacite DOI : ...

  • relatedIdentifiers (datafile) ... Other fields can be add or impoved with the current object model of datasets and datafiles: language, geolocations, related items and identifier for dataset...

Can you please elaborate on "relatedIdentifiers"? We have been begging for the possibility of adding a number (sometimes 100s or 1,000s) of related DOIs when depositing a dataset in Dataverse. Is this now supported or considered? When you say "datafile" does it mean you can upload a datafile? If yes, how should it be formatted?

Our use case is adding a list of "referenced" DOIs to a dataset so that, when a DOI is assigned to it, the list of referenced DOIs is fed to DataCite as relatedIdentifiers. It should then go to EventData or PIDGraph and become discoverable by us.

jggautier commented 2 years ago

Hi @marcomarsella. By "relatedIdentifiers (datafile)", I think @tcoupin meant the Dataverse software is already automatically sending DataCite information about each file that has a PID, using the "hasPart" relationType to say that the dataset contains the files:

For example, in this screenshot of a Datacite.xml export, DataCite would be told that the file with the DOI doi:10.70122/FK2/GFBLSO/S7JA5J is part of the dataset: Screen Shot 2022-05-18 at 1 02 14 PM

This happens as long as the file is given a PID, which isn't the case in most repositories I think, like Harvard's (for example, file info is missing from the DataCite export at https://dataverse.harvard.edu/api/datasets/export?exporter=Datacite&persistentId=doi%3A10.7910/DVN/TMUQHM because that dataset's files aren't being given PIDs)

I feel your pain about needing to send to DataCite identifiers of objects related to the datasets your repository is publishing. The most recent public discussion about this is in the PR at https://github.com/IQSS/dataverse/pull/8357, although I think the discussion in the PR kind of died since and it's only about the Related Publication field, so it doesn't address your other concerns about being able to send DataCite information about other types of research objects, like genetic plant material.

The only update I can give about work toward sending DataCite more metadata, included info about related research objects that can be used in the EventData database you mentioned, is that some folks on the Dataverse team at IQSS are in a recently formed group with leaders of the MakeDataCount standard and other repositories, like Dryad and Figshare, and we're discussing how to improve adoption of MDC. I'm hoping that being able to talk with the MDC folks and with other repositories will help us prioritize working on the design issues that we've talked about in other GitHub issues, including how best to send info to DataCite about related research objects.

mjbuys commented 2 years ago

@jggautier @marcomarsella For clarity, DataCite is co-leading the MDC3 project and will be taking the MDC initiative forwards following the project. It would be great to set up a call with our teams as adding relatedIdentifiers for all research outputs and resources is possible. There are many different downstream use cases (e.g. https://pidnotebooks.org/ that we developed) and additional work such as DMP IDs (e.g. https://support.datacite.org/docs/link-dmp-ids-to-other-resources).

mjbuys commented 2 years ago

@jggautier please could you confirm the DOI used in the screenshot above? As I cannot seem to find 10.70122/FK2/GFBLSO/S7JA5J and then this is the relatedIdentifier shown in the screenshot as well. In theory, if the relatedIdentifiers are added on either DOI, this should be retrievable so we can have our team look into where the issue is.

jggautier commented 2 years ago

Hi @mjbuys. I'll try to help with your second comment first.

The DOI in the screenshot in my comment yesterday is a fake DOI, used in the Demo Dataverse to demonstrate and test the Dataverse software. That relatedIdentifier is about a file within a dataset, which I included to try to answer @marcomarsella's question about what was meant by the line "relatedIdentifiers (datafile)" in this GitHub issue's original post. That original post was meant to list all of the information that the Dataverse software currently sends to DataCite when registering DOIs and mentions some other information that could be sent.

I think that line, "relatedIdentifiers (datafile)", is not related to @marcomarsella's concerns about sending to DataCite information (e.g. relatedIdentifiers and relationTypes) about research outputs and resources that are related to the datasets being published in Dataverse repositories. But in case it's helpful, another example of how the Dataverse software sends to DataCite information about the files within a dataset, using DOIs that point to the files instead of fake DOIs, can be seen from the DataCite xml export of a dataset in the Dartmouth Dataverse.

I would be happy to help set up and join a call about sending to DataCite information about research outputs and resources that are related to the datasets being deposited into Dataverse repositories. Do you work with @marcomarsella or with DataCite (or both)? (I do UX research for the Dataverse software and help support Harvard's Dataverse installation.) I would consider the goal of such a meeting to be about clarifying the discussion in this GitHub issue and other GitHub issues that are more focused on sending DataCite information about related research outputs and resources.

marcomarsella commented 2 years ago

Hi @jggautier, thank you for your reply. By looking at your example, I see a bunch of

<relatedIdentifier relatedIdentifierType="DOI" relationType="HasPart">doi:10.21989/D9/LEQDTS/K4GX4Q</relatedIdentifier>

that are so close to what we need:

<relatedIdentifier relatedIdentifierType="DOI" relationType="References">doi:10.18730/112ZH5 </relatedIdentifier>

The only difference, beside the DOI obviously, is the "References" relationType. It would be great to use the existing mechanism to add more children to the RelatedIdentifiers to do what we need...

mjbuys commented 2 years ago

@jggautier yes - I work at DataCite and @marcomarsella is both a DataCite member (FAO) and a board member. It seems from the example provided, that the discussion is related to which is the best relationType (HasPart versus References). It would probably be good to set up a call to discuss the use case and our team can advise on the best way forward. Let me know who should be included on the call?

marcomarsella commented 2 years ago

@mjbuys I believe that a single relationType cannot work. The current use for HasPart seems correct while our use case requires a different relationType. I vote for References which looks to me more appropriate than Cites in this context

jggautier commented 2 years ago

Ah, thanks. I agree, part of the discussion is about which relationTypes to use and how to add them to the identifiers of related research outputs (e.g. whether to use only one relationType and which one or let depositors choose and how they would choose). These discussions are mostly in the GitHub issues at https://github.com/IQSS/dataverse/issues/2778 and https://github.com/IQSS/dataverse/issues/5277. But other related discussions include how to get and send to DataCite identifiers of other types of research objects, such as other datasets and physical objects, which is also discussed in https://github.com/IQSS/dataverse/issues/5277.

Guidance on which relationTypes to use is one of the things we hope will come out of the current meetings I mentioned with folks working on the MakeDataCount standard (e.g. Daniella Lowenberg) and other generalist repositories (e.g. Dryad and Figshare), which are happening as part of the NIH's Generalist Repository Ecosystem Initiative.

Some of the folks I'd recommend to include on a call would be @mreekie, @scolapasta, @lenwiz, @sbarbosadataverse, @TaniaSchlatter, and @qqmyers. More members of the Dataverse community would like to see this happen and I think should be involved in planning/researching/testing, but I think opening a call to anyone interested in the Dataverse community would depend on the scope of the call and how soon the call should happen. A few key folks have their hands full right now with planning for next month's Dataverse Community Meeting.

mjbuys commented 2 years ago

@KellyStathis please could you coordinate a call with the above folks and @marcomarsella? This follows on from the work that you have been doing with the metadata WG and relationType guidance. Let me know if you would like me to join.

KellyStathis commented 2 years ago

If I am understanding correctly, it looks like the only relatedIdentifiers being sent to Dataverse are the ones for file-level DOIs with the "HasPart" relationType - I'm not seeing anything in the Dataverse 4+ Metadata Crosswalk (if I have the right version - this was linked in the Appendix)

Given this, I think the idea of adding options to select the relationType (as described in https://github.com/IQSS/dataverse/issues/2778) for Related Publications (as well as Related Datasets and Related Material) is a good one. I also see the PR https://github.com/IQSS/dataverse/pull/8357 as a good first step because it would mirror the OpenAIRE crosswalk. If IsCitedBy is not accurate, I would suggest IsReferencedBy as an alternative. Both of these relationTypes, along with IsSupplementTo, will generate a citation for the dataset in DataCite event data as described here: https://support.datacite.org/docs/relationtype_for_citation. We will be working on improving this documentation shortly.

@jggautier, do you know who should be the new DataCite Service Provider contact at Dataverse now that Danny has left (is that you?) I was about to reach out to Len later this week, but this conversation seems relevant. I would be happy to have a call to discuss next steps for sending related identifiers to DataCite—just want to make sure we have the right folks included! You can also reach me through support@datacite.org and we can coordinate via email.

jggautier commented 2 years ago

Hi @KellyStathis. You're right, the only relatedIdentifiers that the Dataverse software sends to DataCite are the ones for file-level DOIs with the "HasPart" relationType. The crosswalk you linked to is the right version and I've updated most of it in the past couple of weeks. (I agree it's not very clear what version of the software the crosswalk applies to. We could probably start versioning it or something.)

I don't know who should be the new DataCite Service Provider contact at Dataverse now that Danny has left. I think it would be best to reach out to Len to ask.

pdurbin commented 1 year ago

I'd just like to point out that @jggautier wrote a nice doc called Ideas and questions about sending more metadata to DataCite following the meeting we had last week (which was coordinated in #2778).

There's also been some recent discussion in #5086 and probably other issues.

The docs Julian has been creating are helping us get on the same page! Thanks, @jggautier!

pdurbin commented 2 weeks ago

More fields are being sent to DataCite in this PR:

pdurbin commented 1 week ago

Merged! Can we close this issue?

jggautier commented 1 week ago

@pdurbin, @cmbz asked me and @scolapasta to review what was done in that PR to see how it covers what was discussed in this and the other GitHub issues that are listed in the PR's first comment.

There are things that I think were not covered. I'm inclined to close this GitHub issue after I understand what was done and why.