All work links to Open Library are broken due to wrong casing

alexshpilkin commented 1 year ago

Open Library IDs are case-insensitive in that their casing does not bear information, but the server requires the ID to be passed in uppercase: https://openlibrary.org/b/OL38581116M is a book, while https://openlibrary.org/b/ol38581116m is a 404. However, all (?) Open Library URLs that appear on ORCID web pages seem to be in lowercase, thus 404: see e.g. https://orcid.org/0000-0003-1199-7080.

This seems to happen because when the URL is generated from an ol-type work ID by the resolver service, it is passed (among other things) through org.orcid.core.utils.v3.identifier.normalizers.CaseSensitiveNormalizer, which is under the impression that case-insensitive identifiers entail that the URL can be harmlessly lowercased:

https://github.com/ORCID/ORCID-Source/blob/70964ce41a373ef71b992f0acd6fba9030b351a2/orcid-core/src/main/java/org/orcid/core/utils/v3/identifiers/normalizers/CaseSensitiveNormalizer.java#L26-L30

Open Library has a different opinion of what the normal form of the ID (and therefore URL) should be.

As far as possible solutions are concerned, either the case sensitivity flag needs to become a tristate (uppercase, lowercase, preserve); a further normalization step that fixes the casing for OL links needs to be added; or declaring Open Library IDs to be case sensitive. All of these seem a bit meh. Neither deals with the fact that there are plenty of wrong URLs already stored in the database (they are stored, right?).

wjrsimpson commented 1 year ago

Thanks for your thoughts @alexshpilkin. I am struggling to find the specification for Open Library IDs. Do you have a link?

alexshpilkin commented 1 year ago

@wjrsimpson That’s a fair question. While you can tell that there are broken Open Library links at the ORCID website and that the Open Library website as it is now is case sensitive by simply poking at them, I don’t actually know that Open Library IDs are supposed to be case-insensitive, that was just me trusting your implementation and my own experience. So maybe I shouldn’t have said that with such confidence.

I’ve looked around the OL developer and librarian docs and, surprisingly, I can’t find much about how OLIDs are supposed to work. The best I’ve seen is a brief mention in the “Understanding Identifiers” section of the librarians-in-training guide.

There are some schemas and schema-adjacent things in the Open Library code, though. First, the official client library has a JSON schema for the API, which contains a (case-sensitive) regular expression for work_key, ^/works/OL[0-9]+W$, and similar ones for author_key (with A) and edition_key (with E). Second, the database schema for the backend contains code implying that an OLID in general is OL[0-9]+[A-Z].

Finally, the Wikidata definition for this identifier type says OL[1-9]\d*[AMW], but doesn’t link to an official reference either.

That’s all I could find, unfortunately, but if you want an official word on this I guess asking the Open Library maintainers is also an option.

wjrsimpson commented 1 year ago

@alexshpilkin Thaks for the additional info.

@TomDemeranville Do you happen to know?

TomDemeranville commented 1 year ago

I think we can pretty easily fix this to not alter the case we have in the database for OL identifiers. We still won't be able to can't guarantee they're correct, but we will be able to preserve the case.

I've raised a bug here: https://trello.com/c/z3efGnyn/781-preserve-case-when-normalising-open-library-identifiers

alexshpilkin commented 1 year ago

@TomDemeranville From my outside point of view that sounds like a boring but workable solution. One thing I’m concerned about is existing links: am I right that they are persisted in the database as normalized (currently lowercased) links? and if so, do you plan to fix those up in the data from before the normalization is fixed?

ORCID / ORCID-Source

All work links to Open Library are broken due to wrong casing #6739