Open Anaphory opened 4 years ago
I agree, I also think this is not that transparent, but I recommend you to have a look at pylexibank first, where we use segments for exactly this behaviro, so this will give you a definite tweak showing how you can in fact trigger this behavior.
That's to say that it is not impossible, it is just not that transparent and easy as the param suggests.
I am able to trigger the behaviour I want. I just think that the current implementation has a bug.
If you play with this, you need to also submit a fix to pylexibank, to make sure we don't break compat there, please. I agree that it is better to refactor the current code.
@Anaphory I don't think this is necessarily a bug. The problem is that when you transform - rather than just segment - this is a two-step process. And both steps of the process may require error handling. The error
argument to Tokenizer.transform
is honored by passing it to the first step: https://github.com/cldf/segments/blob/369e36dfb91cf62044c46c24880642dc9885a811/src/segments/tokenizer.py#L254
Now for the second step, mapping the output of the segmentation to a different symbol set according to the profile, it's not entirely clear what should be done. The default implementation of the replace
strategy seems somewhat appropriate: Replace whatever isn't in the profile with the replacement marker. This is the rationale for using self._errors['replace']
in line 262.
I agree, though, that this is not particularly transparent, and also think that we might need two sets of error handling for the two processing steps, since using just one set may be somewhat surprising:
>>> from segments import Tokenizer, Profile
>>> t = Tokenizer(profile=Profile({'Grapheme': 'a', 'IPA': 'b'}), errors_replace=lambda s: '<{0}>'.format(s))
>>> t('ab', column='IPA')
'b <<b>>'
So, while I don't think the current code has a bug, I'm open to an enhancement that allows a second set of error handling functions for the mapping step.
In an email thread we mentioned that we could have a default OP for strict IPA as a data sanity check (default for ipa=True), so the user doesn’t have to create an OP.
And since the guys are in the process of assembling orthography profiles from lexibank for potential reuse, we could integrate segments with the collection of profiles, which would also be useful for tokenization and transformation.
The signature of
Tokenizer.transform
suggests that the function can take an argument to describe how to deal with undefined segments https://github.com/cldf/segments/blob/369e36dfb91cf62044c46c24880642dc9885a811/src/segments/tokenizer.py#L229 but the actual fallback https://github.com/cldf/segments/blob/369e36dfb91cf62044c46c24880642dc9885a811/src/segments/tokenizer.py#L262 is always thereplace
strategy.(I came here looking for a
keep
strategy that would allow me to inspect the errors by bouncing them back to me instead of transforming them, evenignore
replaces them by just nothing, but that's a different issue.)