cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Allow for NFC normalization in output #2

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

I can easily write a wrapper to have NFC normalization in output, as this is useful for my purposes, and LingPy uses NFC, but I suggest it would not be difficult to already include this in the tokenizer itself, that is, when initializing the instance, one could just pass a keyword specifying normalization, or one could allow for a keyword in the output of t.transform. Or would you prefer me to handle it from my side?

xrotwang commented 7 years ago

See https://github.com/bambooforest/segments/blob/fdb6b7ef98b8e013139c61334d6b53f6aa0920a7/segments/tokenizer.py#L161

LinguList commented 7 years ago

cool, thanks a lot!

bambooforest commented 7 years ago

Looks like this will be:

NFC, NFKC, NFD, and NFKD

as per Python unicodedata

xrotwang commented 7 years ago

Yeah, that should probably checked, i.e.:

assert form is None or form in ['NFC', 'NFKC', ...]
bambooforest commented 7 years ago

ok, i'll write a test