ORCID / ORCID-Source

ORCID Open Source Project
https://orcid.org
Other
404 stars 146 forks source link

Invalid URLs and handles break JSON-LD #6542

Open rdmpage opened 2 years ago

rdmpage commented 2 years ago

There are cases where ORCID URLs and Handles are not valid URIs, which breaks attempts to parse JSON-LD as RDF. These happen in about 10-20 records in a sample of 5000 that I am working with. Not supper common, but enough to break things.

URLs sometimes lack the http prefix, e.g the personal page for https://orcid.org/0000-0003-1802-2649. This breaks RDF, but also the ORCID web page: The personal page for Andrey I. Khalaim is given as https://orcid.org/www.zin.ru/labs/insects/hymenopt/personalia/khalaim/ instead of https://www.zin.ru/labs/insects/hymenopt/personalia/khalaim/

Ideally a simple regular expression to check users have actually input a URL would catch these.

For Handles there are some very bad examples at https://orcid.org/0000-0003-2573-1371 such as:

2018 | Dissertation/Thesis SOURCE-WORK-ID: cv-prod-id-513032 HANDLE: Cecchetti, Arianna. "Effects of tourism operations on the bahavioural patterns of dolphin populations off the Azores with particular emphasis on the common dolphin (Delphinus delphis)". 2018. 112 p.. (Dissertação de Mestrado em Biologia). Ponta Delgada: U HANDLE: http://hdl.handle.net/10400.3/4982 OTHER-ID: 101606494 CONTRIBUTORS: Cecchetti, Arianna

Note that first Handle is http://hdl.handle.net/cecchetti,%20arianna.%20%22effects%20of%20tourism%20operations%20on%20the%20bahavioural%20patterns%20of%20dolphin%20populations%20off%20the%20azores%20with%20particular%20emphasis%20on%20the%20common%20dolphin%20(delphinus%20delphis)%22.%202018.%20112%20p..%20(disserta%C3%A7%C3%A3o%20de%20mestrado%20em%20biologia).%20ponta%20delgada:%20u

This is probably a trivial error in the user-supplied content, but ideally this would be caught on input. I realise that dealing with user-supplied content can be a bit of a nightmare.

rdmpage commented 2 years ago

Further examples, for 0000-0003-2861-949X we have DOIs that are broken, e.g.:

Screenshot 2022-10-08 at 10 19 21

Note the | in the middle. These DOIs break any attempt to parse JSON-LD from Orcid.org

TomDemeranville commented 1 year ago

That example has sadly been added by a member, and we see this behaviour from several of our clients. We do normalise many of our identifiers in API3.0, but don't do this for everything. This one has probably got past our parser because it has two dois in it. Argh.

rdmpage commented 1 year ago

Further to the list of woes with ORCID JSON-LD, note that sameAs should be a list of one or more URIs, but ORCID often includes simple strings such as numbers. These are not valid RDF.

Note that it may be slightly confusing because of the way JSON-LD is output because sameAs appears as a list of strings (e.g., "http://some.url"). But it is a list of URIs, not strings. If you look at the context at https://schema.org/docs/jsonldcontext.json you will see sameAs defined as:

"sameAs": {
      "@id": "schema:sameAs",
      "@type": "@id"
    },

This may seem a small point, but it breaks any use of sameAs in SPARQL queries because properly constructed queries expect values of sameAs to be URI not a literal.

It would be great if ORCID were to actually use the RDF it exports ("dog-fooding"), because if it did it would rapidly discover that its RDF output has problems. This is a pity because this is potentially a fabulous resource.