CDLUC3 / ezid-service

4 stars 0 forks source link

Batch fix invalid DataCite xml metadata #250

Closed jsjiang closed 1 year ago

jsjiang commented 1 year ago

Over 40K DataCite records contain invalid xml metadata due to a program bug. Develop a process to batch fix the metadata in EZID and DataCite systems.

jsjiang commented 1 year ago

Related issue https://github.com/CDLUC3/ezid/issues/378

jsjiang commented 1 year ago

Procedure

uc3-ezidui02x2-stg:/home/jjiang/ezid/fix_datacite_xml>cat select_datacite_metadata.sql returns 40K+

select id, identifier, owner_id, ownergroup_id, metadata,
TRIM(BOTH '"' FROM JSON_EXTRACT(metadata, "$.datacite")) as datacite_xml from ezidapp_identifier where identifier like 'doi%' and metadata like '%kernel-\%s%';

Output file: c3-ezidui02x2-stg:/home/jjiang/ezid/fix_datacite_xml/data_files>wc -l datacite_records_to_fix_prd.tsv 40295 datacite_records_to_fix_prd.tsv

def update_datacite_xml(id, datacite_xml, base_url, passwd): url = f"{base_url}/id/{id}"

metadata in name: value format

#datacite = f"datacite: {datacite_xml}"
headers = {
    "Content-Type": "text/plain; charset=UTF-8",
    "Authorization": "Basic " + base64.b64encode(f"admin:{passwd}".encode('utf-8')).decode('utf-8'),
}
try:
    #r = requests.post(url=url, data=datacite, headers=headers)
    # no need to send data for this metadata fix; 
    # the to be fixed data element is in the resource tag which is created by datacite.py
    r = requests.post(url=url, headers=headers)
    #r.raise_for_status()
    return r.text
except Exception as e:
    print(e)

Last 5 records: success: doi:10.7941/D1TP7N - Reserved - Updated in EZID - not showing on Datacite success: doi:10.7941/D1V63M - Public - Updated in EZID - showing correctly on Datacite success: doi:10.7941/D1WK8M - public - Updated in EZID - showing correctly on Datacite success: doi:10.7941/D1ZD0G - reserved - Updated in EZID - not showing on Datacite success: doi:10.7941/D1ZS7X - public - Updated in EZID - showing correctly on Datacite

Note:

jsjiang commented 1 year ago

Note: the ezid-client-tools/batch-register3.py script can be used to reprocess the records without updating existing metadata.