Add `normalize` function

nibe commented 5 years ago

(normalize s form) normalizes the given character or string using the normalization form :nfc, :nfkc, :nfd or :nfkd, which helps in dealing with equivalent code points.

daveyarwood commented 5 years ago

Nice contribution, thanks! I wasn't familiar with normalization prior to looking it up just now, but this seems like a straightforward addition.

daveyarwood commented 5 years ago

...Thinking about this more, I wonder if this functionality really belongs in a library about characters? It seems to be more about Unicode and internationalization, which is related to, but not exactly the purpose of this library. Technically, Normalizer.normalize takes a string and returns a string, which also makes me think maybe this isn't the best fit to include in djy.char.

nibe commented 5 years ago

You can normalize individual characters, which allows for functions like (equivalent c1 c2). The string variant is more useful though.

I'm using normalize prior to char-seq to deal with equivalent strings:

(def s1 "à la carte")
(def s2 "à la carte")
(char/char-seq s1) => (\à \space \l \a \space \c \a \r \t \e)
(char/char-seq s2) => (\a \̀ \space \l \a \space \c \a \r \t \e)
(char/char-seq (char/normalize s2 :nfc)) => (\à \space \l \a \space \c \a \r \t \e)

I agree that normalize might be a better fit for a Unicode string library.

daveyarwood commented 5 years ago

I don't see any existing Unicode utility libraries for Clojure, so perhaps you could be the one to write the library! :)

If you don't mind, I'm going to revert this contribution. I appreciate the PR, nonetheless! It was a good opportunity for me to learn about normalization.

nibe commented 5 years ago

No problem, go ahead. I've created a local library for now. A wrapper for ICU4J would be nice but that goes way beyond my needs.

nibe commented 5 years ago

Here's something related. Do you consider this a bug? NFD: (char/upper-case "â") => \A NFC: (char/upper-case "â") => \Â

daveyarwood commented 5 years ago

Hmm, that's an interesting question. I guess it comes down to whether, for the NFD version, code-point-of should return the code point for a or â.

Right now, the implementation of code-point-of for a String is is (.codePointAt s 0), but there might be a way that we can make that logic smarter, and recognize when there are multiple code points that compose together to form a character with a different logical code point.

I'm leaning towards considering it a bug, and specifying the behavior of code-point-of to be that it will compose together characters like a and ` and return the code point of i.e. à.

I haven't fully thought through the implications of this, but it feels like probably what most users of this library would expect. What are your thoughts?

nibe commented 5 years ago

Normalization using NFC would fix it but then the code point could change to that of an equivalent character, which would be confusing as well: (char/code-point-of (normalize :nfc (char/char' 8486))) => 937

I guess we can leave it as it is. It's just something to be aware of and mentioning it in the description might help.

daveyarwood commented 5 years ago

I agree, it would be good to make sure this is especially clear.

If it helps at all, we do mention this in the namespace docstring, but perhaps it wouldn't hurt if we added a similar call-out in the code-point-of docstring?

nibe commented 5 years ago

I forgot about that part of the namespace docstring. It's good enough I think.

Thanks for creating and still maintaining this library!

daveyarwood / djy

Add `normalize` function #8