Open mariagould opened 6 months ago
Rushiraj created a ticket for similar topic #559 with information on how to retrieve DataCite records by schema version
To get stats on IDs with Schema 3 versions for a specific repository (e.d.cdl.cdl) is as follows:
curl --location 'https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3'
Jing created related/duplicated ticket #556 with additional info. Copy additional info over and close dup. ticket.
We received an email from DataCite regarding Schema 3 deprecating schedule and request of updating metadata to Schema 4.
From: Kelly Stathis support@datacite.org Date: Tuesday, January 30, 2024 at 9:00 AM To: EZID EZID@UCOP.EDU, Rushiraj Nenuji Rushiraj.Nenuji@ucop.edu, John Chodacki John.Chodacki@ucop.edu, Jing Jiang Jing.Jiang@ucop.edu Subject: Action Required: Schema 3 usage within your consortium CAUTION: EXTERNAL EMAIL Dear California Digital Library team,
I'm writing to share that DataCite plans to deprecate Schema 3 on January 1, 2025, and to request your assistance with communicating this change to the Consortium Organizations within your consortium.
You can read more about what will change here: https://support.datacite.org/docs/updating-from-schema-3-to-schema-4. Once we deprecate Schema 3, repositories will be required to use Schema 4 for DOI registration and metadata updates.
There are 8 Repositories in your consortium with at least one Schema 3 DOI. Of these, 2 actively used Schema 3 in the past year to register or update DOIs. The Repositories actively using Schema 3 will be impacted by this change.
To assist you in understanding this usage, I have attached a spreadsheet of Repositories in your consortium to this email. This is broken down as follows:
• Count of DOIs (Total) • Count of DOIs registered/updated in 2023 • Count of Schema 3 DOIs • Count of Schema 3 DOIs registered/updated in 2023 • Count of Schema 3 DOIs missing resourceTypeGeneral • Count of Schema 3 DOIs missing resourceTypeGeneral registered/updated in 2023 • Count of Schema 3 DOIs with contributorType "Funder" • Count of Schema 3 DOIs with contributorType "Funder" registered/updated in 2023
The counts of DOIs missing resourceTypeGeneral and using contributorType "Funder" are included because these DOIs are not compatible with Schema 4. For more information, please see the FAQ covering differences between Schema 3 and Schema 4.
Please work with your Consortium Organizations as soon as possible to ensure that each has sufficient time to update their systems and workflows to use DataCite Metadata Schema 4. We're available to answer any questions you have about the process.
Best regards, Kelly
— Kelly Stathis | Technical Community Manager | DataCite E: kelly.stathis@datacite.org | ORCID W: datacite.org | Blog | Twitter | LinkedIn Support Desk | Support Site | PID Forum
DataCite report (Jan 2024) on Schema 3 usage within your consortium:
Repo ID | Repo Name | Total DOIs | Total V3 DOIs | V3 DOIs missing resourceTypeGeneral | V3 DOIs with contributorType "Funder |
---|---|---|---|---|---|
cdl.ucb | UC Berkeley | 39,496 | 24,524 | 7,574 | 0 |
cdl.ucsb | UC Santa Barbara | 13,1803 | 3,856 | 26 | 0 |
cdl.cdl | CDL | 20,645 | 3,851 | 18 | 0 |
cdl.ucla | UC Los Angeles | 10,496 | 0 | 0 | 0 |
cdl.ucsd | UC San Diego | 129,765 | 632 | 530 | 0 |
cdl.ucr | UC Riverside | 136 | 0 | 0 | 0 |
cdl.uci | UC Irvine | 1,414 | 3 | 0 | 1 |
cd.ucsc | UC Santa Cruz | 146 | 0 | 0 | 0 |
cdl.ucd | UC Davis | 221 | 1 | 0 | 0 |
cdl.ucsf | UC San Francisco | 32 | 9 | 0 | 0 |
cdl.ucm | UC Merced | 5 | 1 | 0 | 0 |
Query to find Schema 3 records:
Query to find Schema 3 records that are missing resourceTypeGeneral:
Query to find schema 3 records that use the contributorType "Funder"
Records by schema versions (https://doi.datacite.org/providers/cdlco/dois):
v2.1 records: https://doi.datacite.org/providers/cdlco/dois?schema-version=2.1:
Version 3 and version 2.2 records are retrieved and saved in the Google Drive folder:
Code for validating and formatting:
ezidapp.models.identifier.IdentifierBase.clean():
def clean(self):
self.baseClean()
if self.isAgentPid:
self.cleanAgentPid()
self.cleanCitationMetadataFields()
self.checkMetadataRequirements()
self.computeComputedValues()
Notes:
checkMetadataRequirements()
function calls formRecord
to generate an XML record. However when the record is in the "datacite.fieldname: value" format, the xml record is not used by the process (the metadata field is not updated to the xml version record). So the "datacite.fieldname: value" format record is saved as is in EZID.proc-datacite.py => _create_or_update() => impl.datacite.uploadMetadata() => impl.datacite.formRecord(): Form an XML record for upload to DataCite, employing metadata mapping if necessary
METADATA_TEMPLATE = """<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns="http://datacite.org/schema/kernel-4"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://datacite.org/schema/kernel-4
http://schema.datacite.org/meta/kernel-4/metadata.xsd">
<identifier identifierType="{}">{}</identifier>
<creators>
<creator>
<creatorName>{}</creatorName>
</creator>
</creators>
<titles>
<title>{}</title>
</titles>
<publisher>{}</publisher>
<publicationYear>{}</publicationYear>
"""
The upgradeDcmsRecord
function in datacite.py
was developed to convert a DataCite Metadata Schema record to the latest version (currently, version 4). What is does currently:
Convert resourceType
and resourceTypeGeneral
to version 4 competitive format
resourceType
element:resourceType
element:resourceTypeGeneral
attribute is not defined: report error.Handle the contributor type "Funder" that went away in version 4
Retrieved DataCite 3 records by campus:
Sample command:
curl --location 'https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3' > datacite_cdl.cdl_v3.json
Record files are saved in the Google Drive folder EZID/Identifiers/DataCite/DataCite_3_records
Note: Each file only contains 25 records (1st page with default size). Find a way to retrieve all records for each campus.
DataCite API offers two pagination options:
Example to retrieve the first 1,000 records:
curl --location "https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3&page[cursor]=1&page[size]=1000" > datacite_cdl.cdl_v3_1.json
Results file contains total records and page counts, plus the URL for retrieving the next page:
"meta": {
"total": 3567,
"totalPages": 4,
"links": {
"self": "https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3&page[cursor]=1&page[size]=1000",
"next": "https://api.datacite.org/dois?client-id=cdl.cdl&page%5Bcursor%5D=MTQzODQzNzE4OTAwMCwxMC4xNTE0NC9wbC1jNDkuMzY1&page%5Bsize%5D=1000"
}
Note: need to manually add search criteria "schema-version=3" to the next page url:
Change from:
https://api.datacite.org/dois?client-id=cdl.cdl&page%5Bcursor%5D=MTQzODQzNzE4OTAwMCwxMC4xNTE0NC9wbC1jNDkuMzY1&page%5Bsize%5D=1000
To:
https://api.datacite.org/dois?client-id=cdl.cdl&schema-version=3&page%5Bcursor%5D=MTQzODQzNzE4OTAwMCwxMC4xNTE0NC9wbC1jNDkuMzY1&page%5Bsize%5D=1000
scripts/retrieve_datacite_records.py
to automatically retrieve DataCite records and produce DOI lists.campus_id_v3_querytype_page[no].json
format. Page size is set to1000 records.campus_id_v3_querytype.txt
formatCounts of DOIs by campus and by categories (Retrieved on June 17, 2024):
(ezid-py38) CDL-jjiang-9m:datacite_records jjiang$ wc -l *.txt
3567 cdl.cdl_v3.txt
10 cdl.cdl_v3_wo_res_type_gen.txt
0 cdl.cdl_v3_wt_contrib_funder.txt
24983 cdl.ucb_v3.txt
8043 cdl.ucb_v3_wo_res_type_gen.txt
0 cdl.ucb_v3_wt_contrib_funder.txt
0 cdl.ucd_v3.txt
0 cdl.ucd_v3_wo_res_type_gen.txt
0 cdl.ucd_v3_wt_contrib_funder.txt
0 cdl.uci_v3.txt
0 cdl.uci_v3_wo_res_type_gen.txt
0 cdl.uci_v3_wt_contrib_funder.txt
0 cdl.ucla_v3.txt
0 cdl.ucla_v3_wo_res_type_gen.txt
0 cdl.ucla_v3_wt_contrib_funder.txt
0 cdl.ucm_v3.txt
0 cdl.ucm_v3_wo_res_type_gen.txt
0 cdl.ucm_v3_wt_contrib_funder.txt
0 cdl.ucr_v3.txt
0 cdl.ucr_v3_wo_res_type_gen.txt
0 cdl.ucr_v3_wt_contrib_funder.txt
3843 cdl.ucsb_v3.txt
26 cdl.ucsb_v3_wo_res_type_gen.txt
0 cdl.ucsb_v3_wt_contrib_funder.txt
0 cdl.ucsc_v3.txt
0 cdl.ucsc_v3_wo_res_type_gen.txt
0 cdl.ucsc_v3_wt_contrib_funder.txt
632 cdl.ucsd_v3.txt
530 cdl.ucsd_v3_wo_res_type_gen.txt
0 cdl.ucsd_v3_wt_contrib_funder.txt
0 cdl.ucsf_v3.txt
0 cdl.ucsf_v3_wo_res_type_gen.txt
0 cdl.ucsf_v3_wt_contrib_funder.txt
3567 cdl.cdl_v3.txt
24983 cdl.ucb_v3.txt
0 cdl.ucd_v3.txt
0 cdl.uci_v3.txt
0 cdl.ucla_v3.txt
0 cdl.ucm_v3.txt
0 cdl.ucr_v3.txt
3843 cdl.ucsb_v3.txt
0 cdl.ucsc_v3.txt
632 cdl.ucsd_v3.txt
0 cdl.ucsf_v3.txt
33025 total
10 cdl.cdl_v3_wo_res_type_gen.txt
8043 cdl.ucb_v3_wo_res_type_gen.txt
0 cdl.ucd_v3_wo_res_type_gen.txt
0 cdl.uci_v3_wo_res_type_gen.txt
0 cdl.ucla_v3_wo_res_type_gen.txt
0 cdl.ucm_v3_wo_res_type_gen.txt
0 cdl.ucr_v3_wo_res_type_gen.txt
26 cdl.ucsb_v3_wo_res_type_gen.txt
0 cdl.ucsc_v3_wo_res_type_gen.txt
530 cdl.ucsd_v3_wo_res_type_gen.txt
0 cdl.ucsf_v3_wo_res_type_gen.txt
8609 total
0 cdl.cdl_v3_wt_contrib_funder.txt
0 cdl.ucb_v3_wt_contrib_funder.txt
0 cdl.ucd_v3_wt_contrib_funder.txt
0 cdl.uci_v3_wt_contrib_funder.txt
0 cdl.ucla_v3_wt_contrib_funder.txt
0 cdl.ucm_v3_wt_contrib_funder.txt
0 cdl.ucr_v3_wt_contrib_funder.txt
0 cdl.ucsb_v3_wt_contrib_funder.txt
0 cdl.ucsc_v3_wt_contrib_funder.txt
0 cdl.ucsd_v3_wt_contrib_funder.txt
0 cdl.ucsf_v3_wt_contrib_funder.txt
0 total
Noting change in v3 record counts from from January 2024:
repo_id | 2024-01 | 2024-06-17 | change |
---|---|---|---|
cdl.cdl | 3851 | 3567 | -284 |
cdl.ucb | 24524 | 24983 | 459 |
cdl.ucsb | 3856 | 3843 | -13 |
cdl.ucsd | 632 | 632 | 0 |
Retrieved v2.2 records using scripts/retrieve_datacite_records.py
(with some modifications).
11171 cdl.cdl_v22.txt
49 cdl.ucb_v22.txt
0 cdl.ucd_v22.txt
1 cdl.uci_v22.txt
0 cdl.ucla_v22.txt
0 cdl.ucm_v22.txt
0 cdl.ucr_v22.txt
753 cdl.ucsb_v22.txt
0 cdl.ucsc_v22.txt
23 cdl.ucsd_v22.txt
0 cdl.ucsf_v22.txt
11997 total
11169 cdl.cdl_v22_wo_res_type_gen.txt
0 cdl.ucb_v22_wo_res_type_gen.txt
0 cdl.ucd_v22_wo_res_type_gen.txt
0 cdl.uci_v22_wo_res_type_gen.txt
0 cdl.ucla_v22_wo_res_type_gen.txt
0 cdl.ucm_v22_wo_res_type_gen.txt
0 cdl.ucr_v22_wo_res_type_gen.txt
98 cdl.ucsb_v22_wo_res_type_gen.txt
0 cdl.ucsc_v22_wo_res_type_gen.txt
0 cdl.ucsd_v22_wo_res_type_gen.txt
0 cdl.ucsf_v22_wo_res_type_gen.txt
11267 total
0 cdl.cdl_v22_wt_contrib_funder.txt
0 cdl.ucb_v22_wt_contrib_funder.txt
0 cdl.ucd_v22_wt_contrib_funder.txt
0 cdl.uci_v22_wt_contrib_funder.txt
0 cdl.ucla_v22_wt_contrib_funder.txt
0 cdl.ucm_v22_wt_contrib_funder.txt
0 cdl.ucr_v22_wt_contrib_funder.txt
0 cdl.ucsb_v22_wt_contrib_funder.txt
0 cdl.ucsc_v22_wt_contrib_funder.txt
0 cdl.ucsd_v22_wt_contrib_funder.txt
0 cdl.ucsf_v22_wt_contrib_funder.txt
0 total
As of January 2025, DataCite will require that DOIs be registered and updated with schema version 4.0 or newer. See the initial announcement here. https://datacite.org/blog/deprecating-schema-3/.
To keep up with DataCite policies and practices, EZID's DataCite configuration needs to be updated so that DataCite DOIs can no longer be registered or updated with schema versions older than 4.0. This affects DOIs created via the API, UI, and XML deposits. Users will need to be informed about the change in advance and provided with guidance about upgrading.
Steps