caltechlibrary / irdmtools

A Go and Python package for working with InvenioRDM repositories.
https://caltechlibrary.github.io/irdmtools
Other
1 stars 1 forks source link

doi2rdm: Detect and re-encode escaped characters #54

Closed tmorrell closed 11 months ago

tmorrell commented 11 months ago

Some publishers put escaped characters in their metadata (example short-container-title in https://api.crossref.org/works/10.1214/10-AIHP373). We should detect these and switch them to utf-8.

rsdoiel commented 11 months ago

This has been cropping up more and more. Sorta like the old double encoding of HTML's encoded ampersands and things. Problem want to solve this in the crossrefapi package.

rsdoiel commented 11 months ago

I was looking at this double encoding issue. I wonder if they UTF-8 point encoded just '<' and '>' or if they did more UTF-8 characters. If they did just the angle brackets that might have been done internally en their systems to deal with issues of moving between XML and HTML entities. In the record with DOI 10.1214/10-AIHP373 there are other characters I would expect to be encoded (e.g. the accented e in "Annales de l'Institut Henri Poincaré, Probabilités et Statistiques".

On the otherhand they could be JSON encoding some other character besides UTF-8 (e.g. an extended Windows ASCII) rather than conforming to JSON spec which specififies UTF-8 encoding explicitly. Need more data.

In this reference line you see a good example of characters that aren't UTF-8 point code encoded but could be while the angle brackets are point encoded.

"[14] W. König. Orthogonal polynomial ensembles in probability theory. \u003ci\u003eProbab. Surv.\u003c/i\u003e \u003cb\u003e2\u003c/b\u003e (2005) 385–447."

rsdoiel commented 11 months ago

This is fixed in release v0.0.53 with then normalization of using custom JSON encoder/decoders.