Closed LinguList closed 7 years ago
cool, thanks a lot!
Looks like this will be:
NFC, NFKC, NFD, and NFKD
as per Python unicodedata
Yeah, that should probably checked, i.e.:
assert form is None or form in ['NFC', 'NFKC', ...]
ok, i'll write a test
I can easily write a wrapper to have NFC normalization in output, as this is useful for my purposes, and LingPy uses NFC, but I suggest it would not be difficult to already include this in the tokenizer itself, that is, when initializing the instance, one could just pass a keyword specifying normalization, or one could allow for a keyword in the output of t.transform. Or would you prefer me to handle it from my side?