acdh-oeaw / arche-core

MIT License
0 stars 1 forks source link

Provide unicode-invariant search #7

Open zozlak opened 4 years ago

zozlak commented 4 years ago

Currently arche-core stores strings just as they are provided. It would be probably better to normalize them first (to any UTF-8 normalization form) to make searches more reliable.

Using NFKD might be the most desirable as it would allow an easy implementation of language-specific-characters-invariant searches by applying NKFD and skipping all characters with codes > 127 we get pure ASCII string equivalent (although it will only work for alphabets close to English). Of course this should be considered a separate feature.

https://redmine.acdh.oeaw.ac.at/issues/16809