IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Review 12 unpublished datasets with unreserved DOIs, check for duplicates, contact depositors #203

Closed jggautier closed 1 year ago

jggautier commented 1 year ago

After the recent DataCite outage, I used an API endpoint to see if other datasets in the Harvard repo are unpublished with unreserved DOIs. There were 17, including 5 datasets created on the day of the DataCite outage. I used another endpoint to reserve the DOIs of those 5, published the datasets, and followed up the depositors that emailed the support email to let them know their datasets were published.

The other 12 unpublished datasets whose DOIs are unreserved were created between 2019 and 2021. Info about them are in Google Sheets.

Since these datasets have been unpublished for a year or longer, we should:

If the depositors don't reply, these unpublished datasets will eventually be included in the Harvard repo curation team's "production cleanup," where the team will try to contact depositors of datasets that have been unpublished for a certain length of time to encourage the depositors to publish, and the team will remove the datasets if we can't get in touch with the depositors.

sbarbosadataverse commented 1 year ago

Thanks for capturing this, Julian! Let me know if we need further discussion.

On Fri, Dec 2, 2022 at 12:07 PM Julian Gautier @.***> wrote:

After the recent DataCite outage, I used an API endpoint https://urldefense.proofpoint.com/v2/url?u=https-3A__guides.dataverse.org_en_5.12_api_native-2Dapi.html-3Fhighlight-3Dreserve-23list-2Dunreserved-2Dpids&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=IYr3KApN-ltRhAm3tzrpi2UhtbNh6s13tA2pkM6JzG8&e= to see if other datasets in the Harvard repo are unpublished with unreserved DOIs. There were 17, including 5 datasets created on the day of the DataCite outage. I used another endpoint https://urldefense.proofpoint.com/v2/url?u=https-3A__guides.dataverse.org_en_5.12_api_native-2Dapi.html-3Fhighlight-3Dreserve-23reserve-2Da-2Dpid&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=v2nveZbI7tEcTSXu1GutCVO8yIJ1a5fvK9bcWgrlkR4&e= to reserve the DOIs of those 5, published the datasets, and followed up the depositors that emailed the support email to let them know their datasets were published.

The other 12 unpublished datasets whose DOIs are unreserved were created between 2019 and 2021. Info about them are in Google Sheets https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_spreadsheets_d_10hWVBb-2D9GiyBrdx4yZ9RvuuN2S9cXuKa9IuE-2DM1VJ5o&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=AZ9Hnd0wqn2Q1khxdHscG-9rhLOJ86OrzaWyQOlb04o&e= .

Since these datasets have been unpublished for a year or longer, we should:

If the depositors don't reply, these unpublished datasets will eventually be included in the Harvard repo curation team's "production cleanup," where the team will try to contact depositors of datasets that have been unpublished for a certain length of time to encourage the depositors to publish, and the team will remove the datasets if we can't get in touch with the depositors.

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse.harvard.edu_issues_203&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=Bjf4zKtWwqtVyU-CgDX-OAaiueUlPRNhB-9jtdNlzIk&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB7P2KS4S6ZXQ627NG65FSTWLIUEFANCNFSM6AAAAAASSDYFF4&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=e1SGobP_uEm0cTsqY3v2kCljnkMCy0w-us2RmTzonK4&e= . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Sonia Barbosa Manager of Data Curation, The Harvard Dataverse Repository Manager of the Murray Research Archive http://Murray.harvard.edu, IQSS The Dataverse Project http://dataverse.org Data Science Harvard University

Visit our Harvard Dataverse support website: https://support.dataverse.harvard.edu/ Need to deposit data? Visit http://dataverse.harvard.edu Harvard Library RDM services: http://goog_1421170368 https://hlrdm.library.harvard.edu/network All Harvard Dataverse Repository inquiries should be sent to: @. All software inquiries should be sent to: @.

Interested in sharing sensitive data? Coming soon to Harvard Dataverse: http://datatags.org/ All test Dataverse Collections should be created in our demo environment: https://demo.dataverse.org/ Join our Dataverse Community! https://groups.google.com/forum/#!forum/dataverse-communit https://groups.google.com/forum/#!forum/dataverse-communityy

jggautier commented 1 year ago

Thanks. I was able to reserve PIDs for 10 of the 12 datasets, after making sure the data hadn't already been published in other datasets.

The spreadsheet includes the urls of the two datasets whose PIDs I haven't reserved.

  1. For one of the those datasets, I see that its data is in a second unpublished dataset that's been submitted for review in a journal's Dataverse collection. I've contacted the depositor (https://help.hmdc.harvard.edu/Ticket/Display.html?id=331263) to ask if one of the deposits can be deleted.

  2. The second dataset has something in its Producer Affiliation field but its Producer Name field is empty. This isn't allowed anymore (https://github.com/IQSS/dataverse/issues/7606) because of DataCite metadata requirements (https://github.com/IQSS/dataverse/issues/7518), so trying to reserve a PID for that dataset returns an error like: {"status":"ERROR","message":"Problem reserving PID for dataset id #######: Response from postMetadata: 422, DOI 10.7910/dvn/#######: [facet 'minLength'] The value has a length of '0'; this underruns the allowed minimum length of '1'. at line 26, column 0."}

    Looks like the depositor emailed Harvard Dataverse support to report that they couldn't publish the dataset (https://help.hmdc.harvard.edu/Ticket/Display.html?id=293853), which was created before Dataverse's "conditionally required fields" update, and in the email @jyuenger rightly guessed that the problem is due to the DataCite metadata issue.

    I don't know what to put in the Producer Name field. Maybe the depositor considers themselves to be the "Producer" and didn't fill in the Producer Name field because they've already added their name to other fields (like the Author Name and Contact fields). I've followed up in an email to the depositor to ask.

    Hopefully they reply and we can do something to reserve the DOI and publish the dataset (such as adding a Producer Name or deleting what's in the Producer Affiliation field).

jggautier commented 1 year ago

The depositor of one of the two remaining datasets replied over the winter break and I was able to remove that unpublished dataset.

Just one dataset to go. I just sent a follow up email (https://help.hmdc.harvard.edu/Ticket/Display.html?id=293853)

jggautier commented 1 year ago

I haven't heard back from the depositor of the last dataset whose DOI was unreserved. Because it's an unpublished dataset, I just removed what was typed in the Producer Name field, re-saved the unpublished dataset, and used the API endpoint to reserve the DOI.

The curation team will probably remove this unpublished dataset eventually since it's pretty old.

I found another dataset whose DOI was unreserved and I was able to use the API endpoint to reserve it. It looks like these unreserved DOI errors don't happen as often as datasets being locked for a long time (https://github.com/IQSS/dataverse-HDV-Curation/issues/345), but I'll be checking every so often to see if any datasets' DOIs aren't reserved and reserve them.