cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Add "keep" strategy for error handling #45

Closed Anaphory closed 3 years ago

Anaphory commented 4 years ago

As mentioned in #44, I just had a use case (bouncing in order to be able to do more in-depth error handling) for a keep strategy in addition to ignore, replace and strict in https://github.com/cldf/segments/blob/master/src/segments/errors.py and the tokenizers. I will implement it so that I can give you a pull request.

Anaphory commented 4 years ago

(I can trigger the desired behaviour by passing lambda c: c to one of the error handling keyword arguments I don't otherwise use, in my case errors_ignore=. That is however in no way the intended behaviour, and I find these three particular keyword arguments weird anyway.)

https://github.com/cldf/segments/blob/369e36dfb91cf62044c46c24880642dc9885a811/src/segments/tokenizer.py#L114-L116

xrotwang commented 4 years ago

I'm not sure what exactly you think is "weird". The API is designed following similar functionality in Python, see https://docs.python.org/3/library/stdtypes.html#bytes.decode

xrotwang commented 4 years ago

Again following the example of Python's codecs library, you can pass your own callables as error handlers. And it is absolutely intended that you use this mechanism to trigger the behaviour you want to see. I agree, though, that this may not be well documented, in particular considering cases like the one described here - where basically the same error handler is applied twice. But even considering the complication of this two-step process, I don't find it particularly strange to "abuse" the ignore strategy in the first step, to make sure replace does the right thing in the second step.