gbv / cocoda

A web-based tool for creating mappings between knowledge organization systems.
https://coli-conc.gbv.de/cocoda/
MIT License
39 stars 5 forks source link

Apply Unicode normalization to avoid trouble with Umlauts #403

Closed annakasprzik closed 5 years ago

annakasprzik commented 5 years ago

Search terms containing Umlauts are truncated at the "a/o/u", see screenshot Antipa

annakasprzik commented 5 years ago

Also, search terms with Umlaut are not highlighted when there is an exact string match.

stefandesu commented 5 years ago

After further investigation, I've found out that the Umlaut that the GND API returns seem to be non-standard. A normal ä is encoded as %C3%A4, while the ä from GND (actually ä which is a different character) is encoded as a%CC%88 (so I guess it's an a and the dots separately). That's also why it cuts off at a because it is separate from the dots (but if together, it is shown as ä).

I'll think about a way to deal with this, probably some kind of unicode normalization.

nichtich commented 5 years ago

All incoming JSON data should be normalized to NFC as specified in JSKOS spec.