cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Tokenizer.transform ignores error handling argument #44

Open Anaphory opened 4 years ago

Anaphory commented 4 years ago

The signature of Tokenizer.transform suggests that the function can take an argument to describe how to deal with undefined segments https://github.com/cldf/segments/blob/369e36dfb91cf62044c46c24880642dc9885a811/src/segments/tokenizer.py#L229 but the actual fallback https://github.com/cldf/segments/blob/369e36dfb91cf62044c46c24880642dc9885a811/src/segments/tokenizer.py#L262 is always the replace strategy.

(I came here looking for a keep strategy that would allow me to inspect the errors by bouncing them back to me instead of transforming them, even ignore replaces them by just nothing, but that's a different issue.)

LinguList commented 4 years ago

I agree, I also think this is not that transparent, but I recommend you to have a look at pylexibank first, where we use segments for exactly this behaviro, so this will give you a definite tweak showing how you can in fact trigger this behavior.

LinguList commented 4 years ago

That's to say that it is not impossible, it is just not that transparent and easy as the param suggests.

Anaphory commented 4 years ago

I am able to trigger the behaviour I want. I just think that the current implementation has a bug.

LinguList commented 4 years ago

If you play with this, you need to also submit a fix to pylexibank, to make sure we don't break compat there, please. I agree that it is better to refactor the current code.

xrotwang commented 4 years ago

@Anaphory I don't think this is necessarily a bug. The problem is that when you transform - rather than just segment - this is a two-step process. And both steps of the process may require error handling. The error argument to Tokenizer.transform is honored by passing it to the first step: https://github.com/cldf/segments/blob/369e36dfb91cf62044c46c24880642dc9885a811/src/segments/tokenizer.py#L254

Now for the second step, mapping the output of the segmentation to a different symbol set according to the profile, it's not entirely clear what should be done. The default implementation of the replace strategy seems somewhat appropriate: Replace whatever isn't in the profile with the replacement marker. This is the rationale for using self._errors['replace'] in line 262.

I agree, though, that this is not particularly transparent, and also think that we might need two sets of error handling for the two processing steps, since using just one set may be somewhat surprising:

>>> from segments import Tokenizer, Profile
>>> t = Tokenizer(profile=Profile({'Grapheme': 'a', 'IPA': 'b'}), errors_replace=lambda s: '<{0}>'.format(s))
>>> t('ab', column='IPA')
'b <<b>>'

So, while I don't think the current code has a bug, I'm open to an enhancement that allows a second set of error handling functions for the mapping step.

bambooforest commented 4 years ago

In an email thread we mentioned that we could have a default OP for strict IPA as a data sanity check (default for ipa=True), so the user doesn’t have to create an OP.

And since the guys are in the process of assembling orthography profiles from lexibank for potential reuse, we could integrate segments with the collection of profiles, which would also be useful for tokenization and transformation.