unicode normalization - Githubissues

Issue raised by @benc. May want to support unicode normalization for logical paths and object ids. For example, ṩ could be represented as \u1E69, \u0073\u0323\u0307, or \u0073\u0307\u0323. Without normalization, it would difficult to find an object or file that uses such a character without knowing the representation that was used.

For logical paths, strings could be normalized for comparison purposes only and written to the inventory in the same form as they were received.

Object ids would need to be normalized as part of the storage layout.

It's unclear how much of an issue this is. If a user consistently encodes their strings, they will not experience any issues.

https://www.unicode.org/reports/tr15/ https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/text/Normalizer.html

OCFL / ocfl-java

unicode normalization #25