OCFL / ocfl-java

A Java OCFL implementation
MIT License
18 stars 12 forks source link

unicode normalization #25

Closed pwinckles closed 3 years ago

pwinckles commented 3 years ago

Issue raised by @benc. May want to support unicode normalization for logical paths and object ids. For example, could be represented as \u1E69, \u0073\u0323\u0307, or \u0073\u0307\u0323. Without normalization, it would difficult to find an object or file that uses such a character without knowing the representation that was used.

For logical paths, strings could be normalized for comparison purposes only and written to the inventory in the same form as they were received.

Object ids would need to be normalized as part of the storage layout.

It's unclear how much of an issue this is. If a user consistently encodes their strings, they will not experience any issues.

https://www.unicode.org/reports/tr15/ https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/text/Normalizer.html

pwinckles commented 3 years ago

For now, ocfl-java is not going to unicode normalize strings. If users need normalization, it should be done before passing the strings to ocfl-java.