Closed nibe closed 5 years ago
Nice contribution, thanks! I wasn't familiar with normalization prior to looking it up just now, but this seems like a straightforward addition.
...Thinking about this more, I wonder if this functionality really belongs in a library about characters? It seems to be more about Unicode and internationalization, which is related to, but not exactly the purpose of this library. Technically, Normalizer.normalize
takes a string and returns a string, which also makes me think maybe this isn't the best fit to include in djy.char.
You can normalize individual characters, which allows for functions like (equivalent c1 c2)
. The string variant is more useful though.
I'm using normalize
prior to char-seq
to deal with equivalent strings:
(def s1 "à la carte")
(def s2 "à la carte")
(char/char-seq s1) => (\à \space \l \a \space \c \a \r \t \e)
(char/char-seq s2) => (\a \̀ \space \l \a \space \c \a \r \t \e)
(char/char-seq (char/normalize s2 :nfc)) => (\à \space \l \a \space \c \a \r \t \e)
I agree that normalize
might be a better fit for a Unicode string library.
I don't see any existing Unicode utility libraries for Clojure, so perhaps you could be the one to write the library! :)
If you don't mind, I'm going to revert this contribution. I appreciate the PR, nonetheless! It was a good opportunity for me to learn about normalization.
No problem, go ahead. I've created a local library for now. A wrapper for ICU4J would be nice but that goes way beyond my needs.
Here's something related. Do you consider this a bug?
NFD: (char/upper-case "â") => \A
NFC: (char/upper-case "â") => \Â
Hmm, that's an interesting question. I guess it comes down to whether, for the NFD version, code-point-of
should return the code point for a
or â
.
Right now, the implementation of code-point-of
for a String is is (.codePointAt s 0)
, but there might be a way that we can make that logic smarter, and recognize when there are multiple code points that compose together to form a character with a different logical code point.
I'm leaning towards considering it a bug, and specifying the behavior of code-point-of
to be that it will compose together characters like a
and `
and return the code point of i.e. à
.
I haven't fully thought through the implications of this, but it feels like probably what most users of this library would expect. What are your thoughts?
Normalization using NFC would fix it but then the code point could change to that of an equivalent character, which would be confusing as well:
(char/code-point-of (normalize :nfc (char/char' 8486))) => 937
I guess we can leave it as it is. It's just something to be aware of and mentioning it in the description might help.
I agree, it would be good to make sure this is especially clear.
If it helps at all, we do mention this in the namespace docstring, but perhaps it wouldn't hurt if we added a similar call-out in the code-point-of
docstring?
I forgot about that part of the namespace docstring. It's good enough I think.
Thanks for creating and still maintaining this library!
(normalize s form)
normalizes the given character or string using the normalization form:nfc
,:nfkc
,:nfd
or:nfkd
, which helps in dealing with equivalent code points.