cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Handling of longer-than-one graphemes in case of errors #27

Closed tresoldi closed 6 years ago

tresoldi commented 6 years ago

There is a behavior of the tokenizer which I believe is a bug, but might be known or even intended (it ends up helping to catch missing graphemes) -- or I might be missing something obvious (maybe I am building my profiles in a wrong way which is right enough to work in most cases?).

Suppose we have a profile with only three graphemes, t, o, and , mapped to their respective IPA values; the profile is loaded the standard way (passing the filename to __init__, as pylexibank does internally). If we __call__ the tokenizer on a well-formed string such as toːt, it will correctly return t oː t. However, if we introduce an error from a string such as toːt=, it will return t o � t �, failing to match . This is ultimately due to the way missing entries are handled: transform() asks for the graphemes(), which are built by calling self.op.tree.parse() (tokenizer.py:326); if parsing fails, all characters missing from the profile are marked as unknown (tokenizer.py:331). The error is that it should not look for unknown characters, but for unknown graphemes, which can be composed of more than one character (with all the caveats of first looking for the longer graphemes applying).

There would be a few different ways for solving this (if it is supposed to be solved, that is), but I didn't want to proceed on my own without discussing it, especially because any change to segments might unintentionally break a few other operations down the line.

LinguList commented 6 years ago

this has also be torturing me for some time, which is why I initially added the patch that would at least convert all single-character segments and convert them. But apart from this, I figured that it is by NO means trivial to segment if you have one element that fails! You'll need a network of all possible mappings which is simply not what you want, and for this reason, I left it as is in the segments code (before it would not even convert known segments).

But a possible simple strategy, once the segmentation fails, would actually be to re-test segmentation for each combination of the bad word by removing n of its characters. This would not require much time, as we have short strings, and it could be an easy way to handle this, even independently of the segments code. In your case, it would correctly identify that the last character is bad, and if you restrict n to two, assuming that everything with more than two bad segments is problematic enough and should fully fail, this would still maintain a high performance, and be very easy to handle code-wise.

LinguList commented 6 years ago

But unless you are confident to find a true solution for the problem (e.g., a Dijkstra-like search for the best segmentation WITH unknown segments), I'd recommend to not touch the current code, but rather follow the simple patch with element-wise deletion of characters (types not tokens, of course).

tresoldi commented 6 years ago

Indeed, a typical CS-problem. I played with the option of some shortest path algorithm, but maybe we don't need something that powerful?

Just throwing some ideas. My general solution, presented in dumbest possible way to make it clear (not even using yield), is something like:

In [20]: def my_split1(word, graphemes):
    ...:     keys = sorted(graphemes, key=len, reverse=True)
    ...:     longest = len(keys[0])
    ...:     idx = 0
    ...:     
    ...:     f = []
    ...:     while True:
    ...:         for w in range(longest, 0, -1):
    ...:             if word[idx:idx+w] in keys:
    ...:                 f.append(word[idx:idx+w])
    ...:                 idx += w
    ...:             elif w == 1:
    ...:                 f.append('�')
    ...:                 idx += 1
    ...:                 
    ...:         if idx == len(word):
    ...:             break
    ...:             
    ...:     return f
    ...: 

In [21]: profile = {'t':'t', 'o':'o', 'o:': 'oː'}

In [22]: words = ['to:t', 'to:t=']

In [23]: print([my_split1(w, profile.keys()) for w in words])
[['t', 'o:', 't'], ['t', 'o:', 't', '�']]

It works because we always parse/match left to right and potential overlaps are always solved in favor of the leftmost element (like when having ts and s). If you look at it, it is a simple automaton/transducer.

The most obvious way to improve this is to realize that it works like a regular expression because it is one, and in fact we could use python's re instead of using my crappy code above (defaulting to the dot 'match everything' as the last alternative). This, however, can lead to all sort of regex problems, as either the profiles are written with regular expressions in mind (which is not an option) or we treat the graphemes before building the regular expression (doing all sort of escapes, etc.)

In [36]: def my_split2(word, graphemes):
    ...:     keys = sorted(graphemes, key=len, reverse=True) + ['.']
    ...:     regex = '|'.join(keys)
    ...:     
    ...:     subfields = re.findall(regex, word)
    ...:     subfields = [f if f in keys else '�' for f in subfields]
    ...:     
    ...:     return subfields
    ...: 

In [37]: print([my_split2(w, profile.keys()) for w in words])
[['t', 'o:', 't'], ['t', 'o:', 't', '�']]

Other alternatives could be to just split all characters and see if we can join subsequent ones while matching graphemes, doing a "true" subfield partition (with a recursive function applied to both left and right subfields), etc.

LinguList commented 6 years ago

good idea, I think it is worthwhile adding one of the solutions to the OP code, or to test it first a bit. In any way: this has been bothering me for some time now, and I agree it would be useful to handle it albeit in a pragmatic way.

tresoldi commented 6 years ago

Feel free to take the code above, or if you prefer I could fork it and do it after I finish my dataset.

A generic new function based in the first solution seems optimal, so that it can be used in testing (the best idea would seem to check differences in lexibank); if later a better solution/algorithm comes up, just replace the code.

LinguList commented 6 years ago

Maybe let's leave it open for the time being and later discuss. Before next year, I think, I'd not find time to work this in, but as it's straightforward, it will be easy to do later so, I hope (I'd also be keen on testing against my dumb solution of just deleting one char from the string consecutively, until all are tested to find chars with errors no more than one char).

tresoldi commented 6 years ago

Ok, agreed.

2017-12-11 9:53 GMT-02:00 Johann-Mattis List notifications@github.com:

Maybe let's leave it open for the time being and later discuss. Before next year, I think, I'd not find time to work this in, but as it's straightforward, it will be easy to do later so, I hope (I'd also be keen on testing against my dumb solution of just deleting one char from the string consecutively, until all are tested to find chars with errors no more than one char).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cldf/segments/issues/27#issuecomment-350703322, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar93GzUZV6_hgaCsoTtBrCgDuYdmjMks5s_RexgaJpZM4Q8ovs .

tresoldi commented 6 years ago

Just a quick note for future development: I had forgotten about re.escape() which hopefully would solve all the problems I mentioned about using re.

xrotwang commented 6 years ago

@tresoldi looks useful - and nothing is really lost, because we still make sure the result is detectable as error. Do you want to submit a PR or should I merge your code into the package?

tresoldi commented 6 years ago

@xrotwang I could handle it, so we have a proper issue/review/etc. Just merge it if it is somewhat urgent for the refactoring you're working on, not sure I can finish it today (having to deal with the Ausländerbehörde is enough ;) ).

Just to be clear, we are talking about the second solution, with regular expressions and re.escape(), right?

xrotwang commented 6 years ago

@tresoldi yes, the second solution. Thanks!

tresoldi commented 6 years ago

Ok, I'm on it.

xrotwang commented 6 years ago

@tresoldi quick question: The fix you are working on is purely an enhancement of find_missing_characters, right? Because I'm playing around with the Tree.parse, making it return the list of matched graphemes and the remainder of the string in case an error occured; i.e.:

>>> from  segments.tree import Tree
>>> t = Tree(['sch', 'b', 'i', 's', 'ch', 'e', 'n'])
>>> t.parse('bischen')
(['b', 'i', 'sch', 'e', 'n'], '')
>>> t.parse('biscvhen')
(['b', 'i', 's'], 'cvhen')
>>> t.parse('bischven')
(['b', 'i', 'sch'], 'ven')
xrotwang commented 6 years ago

@tresoldi actually, I just got around to have Tree.parse handle errors the way we would like, i.e. restarting after an error (and chopping off one character which possibly caused the error):

>>> from  segments.tree import Tree
>>> t = Tree(['t', 'o', 'oː'])
>>> t.parse('toːt')
['t', 'o\xcb\x90', 't']
>>> t.parse('toːt=')
['t', 'o\xcb\x90', 't', u'\ufffd']
>>> t.parse('toːt=toːt')
['t', 'o\xcb\x90', 't', u'\ufffd', 't', 'o\xcb\x90', 't']
tresoldi commented 6 years ago

It is a way to describe it, yes. As the current version returns an empty string when it fails, my solution is just doing a greedy left-to-right parsing instead of a top-down tree. (a "better than nothing" approach).

That said, while your version is useful (we can know where it is failing), I am not sure if it is better for dataset cleaning than having t.parse('bischven') return (['b', 'i', 'sch', '�', 'e', 'n'])... We are probably once more with different ideas about how it should work in theory -- but yours is more elegant when thinking in abstract, that's for sure.

tresoldi commented 6 years ago

@tresoldi actually, I just got around to have Tree.parse handle errors the way we would like, i.e. restarting after an error (and chopping off one character which possibly caused the error):

That is great! No need for hacky work-arounds with regex! I'll for it to be incorporated and kill the PR later. :)

xrotwang commented 6 years ago

Yes, I think that's the way to go. It's a fairly minimal change in segments.tree and the complete test suite still passes.

tresoldi commented 6 years ago

Great, closed the PR!