CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

Update old DataCite schema records to 4.5 #540

Open mariagould opened 6 months ago

mariagould commented 6 months ago

As of January 2025, DataCite will require that DOIs be registered and updated with schema version 4.0 or newer. See the initial announcement here. https://datacite.org/blog/deprecating-schema-3/.

To keep up with DataCite policies and practices, EZID's DataCite configuration needs to be updated so that DataCite DOIs can no longer be registered or updated with schema versions older than 4.0. This affects DOIs created via the API, UI, and XML deposits. Users will need to be informed about the change in advance and provided with guidance about upgrading.

Steps

jsjiang commented 2 months ago

Rushiraj created a ticket for similar topic #559 with information on how to retrieve DataCite records by schema version

To get stats on IDs with Schema 3 versions for a specific repository (e.d.cdl.cdl) is as follows:

curl --location 'https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3'
jsjiang commented 2 months ago

Jing created related/duplicated ticket #556 with additional info. Copy additional info over and close dup. ticket.

We received an email from DataCite regarding Schema 3 deprecating schedule and request of updating metadata to Schema 4.

From: Kelly Stathis support@datacite.org Date: Tuesday, January 30, 2024 at 9:00 AM To: EZID EZID@UCOP.EDU, Rushiraj Nenuji Rushiraj.Nenuji@ucop.edu, John Chodacki John.Chodacki@ucop.edu, Jing Jiang Jing.Jiang@ucop.edu Subject: Action Required: Schema 3 usage within your consortium CAUTION: EXTERNAL EMAIL Dear California Digital Library team,

I'm writing to share that DataCite plans to deprecate Schema 3 on January 1, 2025, and to request your assistance with communicating this change to the Consortium Organizations within your consortium.

You can read more about what will change here: https://support.datacite.org/docs/updating-from-schema-3-to-schema-4. Once we deprecate Schema 3, repositories will be required to use Schema 4 for DOI registration and metadata updates.

There are 8 Repositories in your consortium with at least one Schema 3 DOI. Of these, 2 actively used Schema 3 in the past year to register or update DOIs. The Repositories actively using Schema 3 will be impacted by this change.

To assist you in understanding this usage, I have attached a spreadsheet of Repositories in your consortium to this email. This is broken down as follows:

• Count of DOIs (Total) • Count of DOIs registered/updated in 2023 • Count of Schema 3 DOIs • Count of Schema 3 DOIs registered/updated in 2023 • Count of Schema 3 DOIs missing resourceTypeGeneral • Count of Schema 3 DOIs missing resourceTypeGeneral registered/updated in 2023 • Count of Schema 3 DOIs with contributorType "Funder" • Count of Schema 3 DOIs with contributorType "Funder" registered/updated in 2023

The counts of DOIs missing resourceTypeGeneral and using contributorType "Funder" are included because these DOIs are not compatible with Schema 4. For more information, please see the FAQ covering differences between Schema 3 and Schema 4.

Please work with your Consortium Organizations as soon as possible to ensure that each has sufficient time to update their systems and workflows to use DataCite Metadata Schema 4. We're available to answer any questions you have about the process.

Best regards, Kelly

— Kelly Stathis | Technical Community Manager | DataCite E: kelly.stathis@datacite.org | ORCID W: datacite.org | Blog | Twitter | LinkedIn Support Desk | Support Site | PID Forum

jsjiang commented 2 months ago

DataCite report (Jan 2024) on Schema 3 usage within your consortium:

cdlco.csv

Repo ID Repo Name Total DOIs Total V3 DOIs V3 DOIs missing resourceTypeGeneral V3 DOIs with contributorType "Funder
cdl.ucb UC Berkeley 39,496 24,524 7,574 0
cdl.ucsb UC Santa Barbara 13,1803 3,856 26 0
cdl.cdl CDL 20,645 3,851 18 0
cdl.ucla UC Los Angeles 10,496 0 0 0
cdl.ucsd UC San Diego 129,765 632 530 0
cdl.ucr UC Riverside 136 0 0 0
cdl.uci UC Irvine 1,414 3 0 1
cd.ucsc UC Santa Cruz 146 0 0 0
cdl.ucd UC Davis 221 1 0 0
cdl.ucsf UC San Francisco 32 9 0 0
cdl.ucm UC Merced 5 1 0 0

Query to find Schema 3 records:

Query to find Schema 3 records that are missing resourceTypeGeneral:

Query to find schema 3 records that use the contributorType "Funder"

jsjiang commented 2 months ago

Records by schema versions (https://doi.datacite.org/providers/cdlco/dois):

v2.1 records: https://doi.datacite.org/providers/cdlco/dois?schema-version=2.1:

Version 3 and version 2.2 records are retrieved and saved in the Google Drive folder:

jsjiang commented 2 months ago
  1. EZID saves DataCite records in two formats:
    • key/value pair in "datacite: xml doc" format
    • key/value pairs in "datacite.fieldname: value" format

Code for validating and formatting:

ezidapp.models.identifier.IdentifierBase.clean():

    def clean(self):
        self.baseClean()
        if self.isAgentPid:
            self.cleanAgentPid()
        self.cleanCitationMetadataFields()
        self.checkMetadataRequirements()
        self.computeComputedValues()

Notes:

  1. EZID converts a "datacite.fieldname: value" format record to XML based on metadata schema when registering the record with DataCite. So all records in DataCite are in XML format.

proc-datacite.py => _create_or_update() => impl.datacite.uploadMetadata() => impl.datacite.formRecord(): Form an XML record for upload to DataCite, employing metadata mapping if necessary

METADATA_TEMPLATE = """<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns="http://datacite.org/schema/kernel-4"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://datacite.org/schema/kernel-4
    http://schema.datacite.org/meta/kernel-4/metadata.xsd">
  <identifier identifierType="{}">{}</identifier>
  <creators>
    <creator>
      <creatorName>{}</creatorName>
    </creator>
  </creators>
  <titles>
    <title>{}</title>
  </titles>
  <publisher>{}</publisher>
  <publicationYear>{}</publicationYear>
"""
jsjiang commented 1 month ago

The upgradeDcmsRecord function in datacite.py was developed to convert a DataCite Metadata Schema record to the latest version (currently, version 4). What is does currently:

  1. Convert resourceType and resourceTypeGeneral to version 4 competitive format

    • If record does not contain resourceType element:
    • Create one: (:unav)
    • If record contains the resourceType element:
    • If resourceTypeGeneral="Film", change it to "Audiovisual";
    • If resourceTypeGeneral attribute is not defined: report error.
  2. Handle the contributor type "Funder" that went away in version 4

jsjiang commented 3 weeks ago

Retrieved DataCite 3 records by campus:

Sample command:

curl --location 'https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3' > datacite_cdl.cdl_v3.json

Record files are saved in the Google Drive folder EZID/Identifiers/DataCite/DataCite_3_records

Note: Each file only contains 25 records (1st page with default size). Find a way to retrieve all records for each campus.

DataCite API offers two pagination options:

Example to retrieve the first 1,000 records:

curl --location "https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3&page[cursor]=1&page[size]=1000" > datacite_cdl.cdl_v3_1.json

Results file contains total records and page counts, plus the URL for retrieving the next page:

"meta": {
    "total": 3567,
    "totalPages": 4,

  "links": {
    "self": "https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3&page[cursor]=1&page[size]=1000",
    "next": "https://api.datacite.org/dois?client-id=cdl.cdl&page%5Bcursor%5D=MTQzODQzNzE4OTAwMCwxMC4xNTE0NC9wbC1jNDkuMzY1&page%5Bsize%5D=1000"
  }

Note: need to manually add search criteria "schema-version=3" to the next page url: Change from: https://api.datacite.org/dois?client-id=cdl.cdl&page%5Bcursor%5D=MTQzODQzNzE4OTAwMCwxMC4xNTE0NC9wbC1jNDkuMzY1&page%5Bsize%5D=1000

To: https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3&page%5Bcursor%5D=MTQzODQzNzE4OTAwMCwxMC4xNTE0NC9wbC1jNDkuMzY1&page%5Bsize%5D=1000

jsjiang commented 2 weeks ago

Counts of DOIs by campus and by categories (Retrieved on June 17, 2024):

(ezid-py38) CDL-jjiang-9m:datacite_records jjiang$ wc -l *.txt
    3567 cdl.cdl_v3.txt
      10 cdl.cdl_v3_wo_res_type_gen.txt
       0 cdl.cdl_v3_wt_contrib_funder.txt
   24983 cdl.ucb_v3.txt
    8043 cdl.ucb_v3_wo_res_type_gen.txt
       0 cdl.ucb_v3_wt_contrib_funder.txt
       0 cdl.ucd_v3.txt
       0 cdl.ucd_v3_wo_res_type_gen.txt
       0 cdl.ucd_v3_wt_contrib_funder.txt
       0 cdl.uci_v3.txt
       0 cdl.uci_v3_wo_res_type_gen.txt
       0 cdl.uci_v3_wt_contrib_funder.txt
       0 cdl.ucla_v3.txt
       0 cdl.ucla_v3_wo_res_type_gen.txt
       0 cdl.ucla_v3_wt_contrib_funder.txt
       0 cdl.ucm_v3.txt
       0 cdl.ucm_v3_wo_res_type_gen.txt
       0 cdl.ucm_v3_wt_contrib_funder.txt
       0 cdl.ucr_v3.txt
       0 cdl.ucr_v3_wo_res_type_gen.txt
       0 cdl.ucr_v3_wt_contrib_funder.txt
    3843 cdl.ucsb_v3.txt
      26 cdl.ucsb_v3_wo_res_type_gen.txt
       0 cdl.ucsb_v3_wt_contrib_funder.txt
       0 cdl.ucsc_v3.txt
       0 cdl.ucsc_v3_wo_res_type_gen.txt
       0 cdl.ucsc_v3_wt_contrib_funder.txt
     632 cdl.ucsd_v3.txt
     530 cdl.ucsd_v3_wo_res_type_gen.txt
       0 cdl.ucsd_v3_wt_contrib_funder.txt
       0 cdl.ucsf_v3.txt
       0 cdl.ucsf_v3_wo_res_type_gen.txt
       0 cdl.ucsf_v3_wt_contrib_funder.txt
    3567 cdl.cdl_v3.txt
   24983 cdl.ucb_v3.txt
       0 cdl.ucd_v3.txt
       0 cdl.uci_v3.txt
       0 cdl.ucla_v3.txt
       0 cdl.ucm_v3.txt
       0 cdl.ucr_v3.txt
    3843 cdl.ucsb_v3.txt
       0 cdl.ucsc_v3.txt
     632 cdl.ucsd_v3.txt
       0 cdl.ucsf_v3.txt
   33025 total
      10 cdl.cdl_v3_wo_res_type_gen.txt
    8043 cdl.ucb_v3_wo_res_type_gen.txt
       0 cdl.ucd_v3_wo_res_type_gen.txt
       0 cdl.uci_v3_wo_res_type_gen.txt
       0 cdl.ucla_v3_wo_res_type_gen.txt
       0 cdl.ucm_v3_wo_res_type_gen.txt
       0 cdl.ucr_v3_wo_res_type_gen.txt
      26 cdl.ucsb_v3_wo_res_type_gen.txt
       0 cdl.ucsc_v3_wo_res_type_gen.txt
     530 cdl.ucsd_v3_wo_res_type_gen.txt
       0 cdl.ucsf_v3_wo_res_type_gen.txt
    8609 total
       0 cdl.cdl_v3_wt_contrib_funder.txt
       0 cdl.ucb_v3_wt_contrib_funder.txt
       0 cdl.ucd_v3_wt_contrib_funder.txt
       0 cdl.uci_v3_wt_contrib_funder.txt
       0 cdl.ucla_v3_wt_contrib_funder.txt
       0 cdl.ucm_v3_wt_contrib_funder.txt
       0 cdl.ucr_v3_wt_contrib_funder.txt
       0 cdl.ucsb_v3_wt_contrib_funder.txt
       0 cdl.ucsc_v3_wt_contrib_funder.txt
       0 cdl.ucsd_v3_wt_contrib_funder.txt
       0 cdl.ucsf_v3_wt_contrib_funder.txt
       0 total
adambuttrick commented 2 weeks ago

Noting change in v3 record counts from from January 2024:

repo_id 2024-01 2024-06-17 change
cdl.cdl 3851 3567 -284
cdl.ucb 24524 24983 459
cdl.ucsb 3856 3843 -13
cdl.ucsd 632 632 0
jsjiang commented 2 weeks ago

Retrieved v2.2 records using scripts/retrieve_datacite_records.py (with some modifications).

   11171 cdl.cdl_v22.txt
      49 cdl.ucb_v22.txt
       0 cdl.ucd_v22.txt
       1 cdl.uci_v22.txt
       0 cdl.ucla_v22.txt
       0 cdl.ucm_v22.txt
       0 cdl.ucr_v22.txt
     753 cdl.ucsb_v22.txt
       0 cdl.ucsc_v22.txt
      23 cdl.ucsd_v22.txt
       0 cdl.ucsf_v22.txt
   11997 total
   11169 cdl.cdl_v22_wo_res_type_gen.txt
       0 cdl.ucb_v22_wo_res_type_gen.txt
       0 cdl.ucd_v22_wo_res_type_gen.txt
       0 cdl.uci_v22_wo_res_type_gen.txt
       0 cdl.ucla_v22_wo_res_type_gen.txt
       0 cdl.ucm_v22_wo_res_type_gen.txt
       0 cdl.ucr_v22_wo_res_type_gen.txt
      98 cdl.ucsb_v22_wo_res_type_gen.txt
       0 cdl.ucsc_v22_wo_res_type_gen.txt
       0 cdl.ucsd_v22_wo_res_type_gen.txt
       0 cdl.ucsf_v22_wo_res_type_gen.txt
   11267 total
       0 cdl.cdl_v22_wt_contrib_funder.txt
       0 cdl.ucb_v22_wt_contrib_funder.txt
       0 cdl.ucd_v22_wt_contrib_funder.txt
       0 cdl.uci_v22_wt_contrib_funder.txt
       0 cdl.ucla_v22_wt_contrib_funder.txt
       0 cdl.ucm_v22_wt_contrib_funder.txt
       0 cdl.ucr_v22_wt_contrib_funder.txt
       0 cdl.ucsb_v22_wt_contrib_funder.txt
       0 cdl.ucsc_v22_wt_contrib_funder.txt
       0 cdl.ucsd_v22_wt_contrib_funder.txt
       0 cdl.ucsf_v22_wt_contrib_funder.txt
       0 total