cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Confusing tokenization behaviour for incomplete orthography profiles #7

Closed xrotwang closed 7 years ago

xrotwang commented 7 years ago

I'm somewhat confused by the way the missing, _missing and exception keyword arguments of Tokenizer.transform are supposed to work. Locally, I fixed a bug where a callable was passed to _search_graphemes as _missing, but now graphemes missing in the profile are just silently tokenized by default - which I think is wrong. If a profile exists, missing tokens should raise an exception by default, right?

xrotwang commented 7 years ago

@LinguList I think this behaviour was introduced in your PR https://github.com/bambooforest/segments/commit/ab017b3dee4d20abe35f1d2e0271925faa1893e6 Any ideas?

xrotwang commented 7 years ago

I think a couple of tests would go a long way towards explaining the intent of the functionality.

LinguList commented 7 years ago

me culpa. The behaviour before was: just don't tokenize things if there are unknown things. This is unwanted behaviour, becuase it's extremely difficult to debug. So I introduced a customizable character that would replace missing characters, e.g., a question mark.

Now, I think I messed things up in terms of expliciteness, since -- soooryyy -- I wanted to have a working solution that I could use to create my orthoprofiles.

Looking at it right now, I think one could replace the _missing by "missing", but I think what I ALSO wanted to have as a specific behaviour is what is also criticizable, the possibility to mark the missing graphemes in some user-defined way, like "", for a missing "m". So I intruduced and ugly lambda function in "transform".

LinguList commented 7 years ago

I'll have a closer look in the afternoon / after department meeting. To summarize the wanted behavior for the moment:

Is that understandable so far? For me, in practice, it helped a lot, and otherwise, I woul've hat huge problems in creating any of teh orthoprofiles I created, as this can be very tedious and time consuming.

xrotwang commented 7 years ago

Ok. I see the usefulness of this behaviour; just wanted to double-check that the default shouldn't be silently making up invalid tokens. So perhaps missing should always be a callable, defaulting to raising an exception?

LinguList commented 7 years ago

Yep, I guess, that is useful.

xrotwang commented 7 years ago

@LinguList If you could turn your description of the desired behaviour into a couple of tests, that would also be a good way to document it.

LinguList commented 7 years ago

allright, will do so!

bambooforest commented 7 years ago

in the book, page 86, we state in the formal orthography specification:

B9. Leftover characters, i.e. characters that are not matched by the profile, should be reported to the user as errors. Typically, the un- matched character are replaced in the tokenization by a user-specified symbol-string.

honestly, i don't remember what the default behavior of segments was before, but let's let the user specify the symbol moving forward, with default being i guess "?" (because i think it's useful and standard to reserve "#" for word boundaries)

LinguList commented 7 years ago

Yes, it was a question mark, I remember, but here is my problem: if it is always a question mark, we may confuse a real question mark with an error-questionmark. Furthermore, we would like to see what went wrong, that is, what character was not captured, and for this reason, I introduced that crappy implementation of the new behaviour that would model something like:

t_hOxt@

both as

t_h O x t <@>

in the graphemes, but also as:

tʰ ɔ x t <@>

in the transformed form. Here, we conveniently see that the word was nicely converted, but that my sampa-profile lacks a specification for "@". Given that the marking of errors can easily be ambiguous, since there may be datasets that HAVE <> for other purposes, we further need to allow the user to change the default marking (for the moment, <> works fine, but normal brackets would be problematic, but something like "?:@" would work as well, etc.).

You see the point? Despite the ugly code, in practice, this greatly facilitated the production of the profiles I wrote so far, as I could immediately see what I had been missing.

xrotwang commented 7 years ago

I just looked up how python handles encoding/decoding errors. It seems as if this scheme could also handle all of our use cases, and would be the easiest in terms of documentation.

So Tokenizer.__call__ would gain a keyword argument errors which accepts a string, specifying the desired behaviour, defaulting to 'replace'. Following the python practice of using U+FFFD REPLACEMENT CHARACTER to signal replacement also seems a good idea, I think. In any case we can also copy the error handler registration mechanism, to allow even more control (but replacing U+FFFD REPLACEMENT CHARACTER with some custom character after tokenization doesn't seem too much effort either).

xrotwang commented 7 years ago

@LinguList your <...> replacement scheme could be implemented by registering a custom error handler for replacement mode, e.g.:

t = Tokenizer()
t.register_error('replace', lambda c: '<{0}>'.format(c))
t('t_hOxt@')
LinguList commented 7 years ago

excellent. As long as we can have this behavior, all is fine with me.

xrotwang commented 7 years ago

@LinguList @bambooforest I'll put together a PR for you to review.

LinguList commented 7 years ago

super, thanks in advance, and sorry for the bad code I submitted in 2016: I was under pressure and needed the things to run, but I'll try to improve with these things and (try to) communicate desired behavior discussions before coding them in the future.

xrotwang commented 7 years ago

https://www.youtube.com/watch?v=8jQlubar9A8

bambooforest commented 7 years ago

sounds like an excellent solution @xrotwang -- the replacement character even contains a question mark, haha.

http://www.fileformat.info/info/unicode/char/fffd/index.htm

and has some decent Font coverage

http://www.fileformat.info/info/unicode/char/fffd/fontsupport.htm

xrotwang commented 7 years ago

Just came across one complication: When implementing a test for @LinguList's <...> replacement scheme, I happened to use a character which isn't in test.prf, but is in test.rules, so to my confusion, I got

>>> t('habe')
... '<i> a b <e>'

One could argue that rules shouldn't be applied to results of the replacement error handler. But this would require some knowledge of the implementation of this handler. But since the use case for custom replacement handlers seems to be solely the initial creation of orthography profiles, I'd guess there won't be any rules defined by then. So I'd say custom error handlers and their effects are the sole responsibility of the calling code, not of the tokenizer.

LinguList commented 7 years ago

In general, when working on my SEA language orthoprofiles, I tried to avoid rules as long as possible, preferring to customize output by spelling out whole words, which are exceptions, rather than providing a small rule. This increases redundancy, but it is easier to control, since rules are also following specific ordering, while redundant full-word segmentation for erroneous transcriptions does not require any order, as the algorithm starts with the longest chunk anyway. I have made good experience in practice on that, and I'd say that this should be a recommendation for all orthoprofiles: try to avoid the usage of rules.