Closed xrotwang closed 7 years ago
@LinguList I think this behaviour was introduced in your PR https://github.com/bambooforest/segments/commit/ab017b3dee4d20abe35f1d2e0271925faa1893e6 Any ideas?
I think a couple of tests would go a long way towards explaining the intent of the functionality.
me culpa. The behaviour before was: just don't tokenize things if there are unknown things. This is unwanted behaviour, becuase it's extremely difficult to debug. So I introduced a customizable character that would replace missing characters, e.g., a question mark.
Now, I think I messed things up in terms of expliciteness, since -- soooryyy -- I wanted to have a working solution that I could use to create my orthoprofiles.
Looking at it right now, I think one could replace the _missing by "missing", but I think what I ALSO wanted to have as a specific behaviour is what is also criticizable, the possibility to mark the missing graphemes in some user-defined way, like "
I'll have a closer look in the afternoon / after department meeting. To summarize the wanted behavior for the moment:
Is that understandable so far? For me, in practice, it helped a lot, and otherwise, I woul've hat huge problems in creating any of teh orthoprofiles I created, as this can be very tedious and time consuming.
Ok. I see the usefulness of this behaviour; just wanted to double-check that the default shouldn't be silently making up invalid tokens. So perhaps missing
should always be a callable, defaulting to raising an exception?
Yep, I guess, that is useful.
@LinguList If you could turn your description of the desired behaviour into a couple of tests, that would also be a good way to document it.
allright, will do so!
in the book, page 86, we state in the formal orthography specification:
B9. Leftover characters, i.e. characters that are not matched by the profile, should be reported to the user as errors. Typically, the un- matched character are replaced in the tokenization by a user-specified symbol-string.
honestly, i don't remember what the default behavior of segments
was before, but let's let the user specify the symbol moving forward, with default being i guess "?" (because i think it's useful and standard to reserve "#" for word boundaries)
Yes, it was a question mark, I remember, but here is my problem: if it is always a question mark, we may confuse a real question mark with an error-questionmark. Furthermore, we would like to see what went wrong, that is, what character was not captured, and for this reason, I introduced that crappy implementation of the new behaviour that would model something like:
t_hOxt@
both as
t_h O x t <@>
in the graphemes, but also as:
tʰ ɔ x t <@>
in the transformed form. Here, we conveniently see that the word was nicely converted, but that my sampa-profile lacks a specification for "@". Given that the marking of errors can easily be ambiguous, since there may be datasets that HAVE <> for other purposes, we further need to allow the user to change the default marking (for the moment, <> works fine, but normal brackets would be problematic, but something like "?:@" would work as well, etc.).
You see the point? Despite the ugly code, in practice, this greatly facilitated the production of the profiles I wrote so far, as I could immediately see what I had been missing.
I just looked up how python handles encoding/decoding errors. It seems as if this scheme could also handle all of our use cases, and would be the easiest in terms of documentation.
So Tokenizer.__call__
would gain a keyword argument errors
which accepts a string, specifying the desired behaviour, defaulting to 'replace'
. Following the python practice of using U+FFFD REPLACEMENT CHARACTER
to signal replacement also seems a good idea, I think. In any case we can also copy the error handler registration mechanism, to allow even more control (but replacing U+FFFD REPLACEMENT CHARACTER
with some custom character after tokenization doesn't seem too much effort either).
@LinguList your <...>
replacement scheme could be implemented by registering a custom error handler for replacement
mode, e.g.:
t = Tokenizer()
t.register_error('replace', lambda c: '<{0}>'.format(c))
t('t_hOxt@')
excellent. As long as we can have this behavior, all is fine with me.
@LinguList @bambooforest I'll put together a PR for you to review.
super, thanks in advance, and sorry for the bad code I submitted in 2016: I was under pressure and needed the things to run, but I'll try to improve with these things and (try to) communicate desired behavior discussions before coding them in the future.
sounds like an excellent solution @xrotwang -- the replacement character even contains a question mark, haha.
http://www.fileformat.info/info/unicode/char/fffd/index.htm
and has some decent Font coverage
http://www.fileformat.info/info/unicode/char/fffd/fontsupport.htm
Just came across one complication: When implementing a test for @LinguList's <...>
replacement scheme, I happened to use a character which isn't in test.prf
, but is in test.rules
, so to my confusion, I got
>>> t('habe')
... '<i> a b <e>'
One could argue that rules shouldn't be applied to results of the replacement error handler. But this would require some knowledge of the implementation of this handler. But since the use case for custom replacement handlers seems to be solely the initial creation of orthography profiles, I'd guess there won't be any rules defined by then. So I'd say custom error handlers and their effects are the sole responsibility of the calling code, not of the tokenizer.
In general, when working on my SEA language orthoprofiles, I tried to avoid rules as long as possible, preferring to customize output by spelling out whole words, which are exceptions, rather than providing a small rule. This increases redundancy, but it is easier to control, since rules are also following specific ordering, while redundant full-word segmentation for erroneous transcriptions does not require any order, as the algorithm starts with the longest chunk anyway. I have made good experience in practice on that, and I'd say that this should be a recommendation for all orthoprofiles: try to avoid the usage of rules.
I'm somewhat confused by the way the
missing
,_missing
andexception
keyword arguments ofTokenizer.transform
are supposed to work. Locally, I fixed a bug where a callable was passed to_search_graphemes
as_missing
, but now graphemes missing in the profile are just silently tokenized by default - which I think is wrong. If a profile exists, missing tokens should raise an exception by default, right?