IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Send to DataCite the relationType metadata of files that have PIDs #146

Open jggautier opened 2 years ago

jggautier commented 2 years ago

When the Harvard Dataverse Repository started giving DOIs to published files, the metadata it sent to DataCite about each files didn't indicated which dataset each file was a part of.

As part of the Dataverse software version 4.9.2 (https://github.com/IQSS/dataverse/issues/4782), that information was added to the metadata that's sent to DataCite when file DOIs are registered, using the DataCite schema's "IsPartOf" relationType and pointing to each dataset's DOI.

Version 4.9.2 was released on Aug. 8, 2018, so sometime after that the Harvard repository was upgraded to v4.9.2 and when new files and new versions of existing files were published, the file metadata that Harvard Dataverse sent to DataCite included that relationType to indicate which dataset each file belonged to.

But it didn't send to DataCite updated metadata for files that were published before the v4.9.2 upgrade.

File PIDs were turned off in Harvard Repository due to performance and cost concerns and repository managers turn them on for certain collections when those collection's owners request that their datasets' files be given DOIs.

So for some number of the repository's 34,773 datasets that have one or more files with DOIs (580,132 files), DataCite doesn't have the updated metadata that says which datasets those files belong to.

The Elsevier platform DataMonitor gets the metadata from research repositories, including the Harvard repository, from DataCite and uses the relationType information in the metadata to improve their users' search experience by returning results that can differentiate between datasets and files. So users of that platform can say "Just show me the datasets associated with a certain author. (Don't show me the files that belong to those datasets.)" But because the Harvard repository hasn't sent to DataCite updated metadata that includes that relationType information, the metadata that DataMonitor gets from DataCite doesn't always include that relationType metadata, which means that DataMonitor can't try to determine what metadata is describing a dataset, what metadata is describing a file, and which files are a part of which datasets, and that information can't be used to improve searches in DataMonitor.

Earlier this month a researcher who uses DataMonitor reported that searches for datasets that return datasets from the Harvard repository also include many files. After being sure why this was happening, with @landreev's help I was able to send DataCite updated metadata for one dataset and its 105 files. The API endpoint for sending DataCite updated dataset metadata works well, but it sends updated metadata of the datasets' files only when the Dataverse installation has file PID registration turned on. File PID registration was turned on briefly so that I could try the endpoint (then turned off a few days later). I confirmed this worked by checking the metadata of those files in DataCite Search. And a product manager and developer at Elsevier confirmed that once DataMonitor updated its metadata from DataCite, which they said they do daily, the search results in DataMonitor that included that dataset worked the way they expected.

This issue is to track discussion of figuring out how to send DataCite the updated metadata of the files that were published before the Harvard Dataverse Repository started sending the "IsPartOf" relationType metadata. There are also related GitHub issues in the main Dataverse GitHub repository about sending updated metadata to DataCite whenever any change is made to the metadata that the software sends to DataCite upon DOI registration: https://github.com/IQSS/dataverse/issues/5144 and https://github.com/IQSS/dataverse/issues/5505.

Of the 34,773 datasets with file DOIs, I'm not sure how many DataCite needs updated metadata for. I thought it might be all of the files in dataset versions published before v4.9.2 was applied to the Harvard repository, but I found a handful of files in dataset versions published years before the v4.9.2 update.

As of July 24, 2024, here's the metadata that DataCite has for a dataset whose files have DOIs, and the metadata for the dataset doesn't indicate which files belong to the dataset: https://api.datacite.org/dois?query=doi:10.7910/DVN/BJO3AV. (Alternatively, see the metadata for a dataset whose files have DOIs where the metadata and the metadata indicates which files belong to the dataset.)

Once the updated metadata of these files is sent to DataCite, the researcher and the folks from Elsevier's DataMonitor platform asked to be notified. Might be easiest to follow up with the researcher using the RT ticket at https://help.hmdc.harvard.edu/Ticket/Display.html?id=315272 and with the folks from DataMonitor using the RT ticket at https://help.hmdc.harvard.edu/Ticket/Display.html?id=315487

jggautier commented 2 years ago

As far as solutions, if what's proposed in https://github.com/IQSS/dataverse/issues/5505 was implemented, it would be possible to have the software figure out which metadata needed updating and update it. Then we'd just have to tell the Harvard repository to do that whenever the metadata it sends to DataCite changes.

Or we could turn on file PIDs long enough to run the API endpoint on all 34,773 datasets.

Or I could scrape from DataCite the file metadata of those 580,132 files and query it to figure out which file metadata documents don't have the updated metadata (the "IsPartOf" relationType). And knowing which datasets those files belong to, we could turn on file PIDs in the Harvard repository just long enough to run the API endpoint on those datasets, which hopefully will be much fewer than 34,000+.

landreev commented 2 years ago

Should it be possible to tell which Datacite records have, or don't have this field simply based on the time of registration? (if any file registered since 4.9.2 should be expected to have it?)

jggautier commented 2 years ago

Maybe! Do the timestamps in the identifierregistered column in the dvobject table show when the object's PID was registered? I think Jim mentioned something similar and I just forgot.

Even still, I found some files published years before the release of 4.9.2 that DataCite had the updated metadata for and that weren't any of the 105 files whose metadata I ran that API endpoint on. That makes me think that this approach of using a timeframe to figure out which file metadata needs to be updated wouldn't be reliable. We would wind up considering some files whose metadata don't need to be updated.

landreev commented 2 years ago

identifierregistered is a boolean, it's either registered or not. But globalidcreatetime is the time stamp of the registration.

landreev commented 2 years ago

As for the files that were published before 4.9.2 that had this field... any chance those are from datasets that had more major versions published since then? From what I saw looking at that code a few days ago, when you were updating that one dataset-worth of files, it looked like we update these metadata records every time a major version is published. The timestamp in dvobject is called "globalidcreatetime", but I believe it's actually the time of the last update, not of the initial registration. (I would need to double-check all these statements).

But maybe the approach should be 1) select the files with the timestatmp > 4.9.2, then 2) scrape the datacite metadata records for these, and further select only the ones that definitely don't have the field - ?

landreev commented 2 years ago

(meant to say "the timestatmp < 4.9.2", sorry)

landreev commented 2 years ago

I also had the wrong column name copy-and-pasted in that comment... damn. (corrected)

jggautier commented 2 years ago

As for the files that were published before 4.9.2 that had this field... any chance those are from datasets that had more major versions published since then?

When I queried the database I tried to consider the publication dates of the dataset versions that the files are in, but I was assuming that we send DataCite updated metadata when either minor or major versions are published. (In my comment from the other issue I didn't specify which type of version publication triggers an update.)

But DataCite Search has the metadata for version 1.6 of the dataset at https://doi.org/10.7910/DVN/DDG8SW. Between version 1 and 1.6 the dataset's title and description changed. Doesn't that mean that updated metadata is also sent to DataCite when minor versions are published? That seems more ideal, too.

In any case using globalidcreatetime sounds a lot more straightforward since it's the time of last update to DataCite. If the approach you mentioned gets the metadata updated sooner than waiting for more automated approaches, I'm all for it :)

qqmyers commented 2 years ago

FWIW: From a code perspective, it would probably not be much work to change from updating files only if file PIDs are on to updating if a file has a PID. That may take longer than you want for an initial update but that would make sure any future changes get propagated to existing file Datacite records going forward.

landreev commented 2 years ago

Between version 1 and 1.6 the dataset's title and description changed. Doesn't that mean that updated metadata is also sent to DataCite when minor versions are published? That seems more ideal, too.

I understand that the Datacite metadata for files are only updated when major versions are published. The metadata for the dataset itself are updated whenever any version is published, major or minor. Again, I can double-check.

landreev commented 2 years ago

For the practical purpose of updating these entries for our production files, I can/am willing to help with batching/scripting however we decide to do this. If we are worried about re-enabling file-level registration in production for the period of time it would take to update these records, we could run this update job on a test instance running with a copy of the prod. database.