Closed tresoldi closed 6 years ago
this has also be torturing me for some time, which is why I initially added the patch that would at least convert all single-character segments and convert them. But apart from this, I figured that it is by NO means trivial to segment if you have one element that fails! You'll need a network of all possible mappings which is simply not what you want, and for this reason, I left it as is in the segments code (before it would not even convert known segments).
But a possible simple strategy, once the segmentation fails, would actually be to re-test segmentation for each combination of the bad word by removing n of its characters. This would not require much time, as we have short strings, and it could be an easy way to handle this, even independently of the segments code. In your case, it would correctly identify that the last character is bad, and if you restrict n to two, assuming that everything with more than two bad segments is problematic enough and should fully fail, this would still maintain a high performance, and be very easy to handle code-wise.
But unless you are confident to find a true solution for the problem (e.g., a Dijkstra-like search for the best segmentation WITH unknown segments), I'd recommend to not touch the current code, but rather follow the simple patch with element-wise deletion of characters (types not tokens, of course).
Indeed, a typical CS-problem. I played with the option of some shortest path algorithm, but maybe we don't need something that powerful?
Just throwing some ideas. My general solution, presented in dumbest possible way to make it clear (not even using yield
), is something like:
In [20]: def my_split1(word, graphemes):
...: keys = sorted(graphemes, key=len, reverse=True)
...: longest = len(keys[0])
...: idx = 0
...:
...: f = []
...: while True:
...: for w in range(longest, 0, -1):
...: if word[idx:idx+w] in keys:
...: f.append(word[idx:idx+w])
...: idx += w
...: elif w == 1:
...: f.append('�')
...: idx += 1
...:
...: if idx == len(word):
...: break
...:
...: return f
...:
In [21]: profile = {'t':'t', 'o':'o', 'o:': 'oː'}
In [22]: words = ['to:t', 'to:t=']
In [23]: print([my_split1(w, profile.keys()) for w in words])
[['t', 'o:', 't'], ['t', 'o:', 't', '�']]
It works because we always parse/match left to right and potential overlaps are always solved in favor of the leftmost element (like when having ts
and s
). If you look at it, it is a simple automaton/transducer.
The most obvious way to improve this is to realize that it works like a regular expression because it is one, and in fact we could use python's re
instead of using my crappy code above (defaulting to the dot 'match everything' as the last alternative). This, however, can lead to all sort of regex problems, as either the profiles are written with regular expressions in mind (which is not an option) or we treat the graphemes before building the regular expression (doing all sort of escapes, etc.)
In [36]: def my_split2(word, graphemes):
...: keys = sorted(graphemes, key=len, reverse=True) + ['.']
...: regex = '|'.join(keys)
...:
...: subfields = re.findall(regex, word)
...: subfields = [f if f in keys else '�' for f in subfields]
...:
...: return subfields
...:
In [37]: print([my_split2(w, profile.keys()) for w in words])
[['t', 'o:', 't'], ['t', 'o:', 't', '�']]
Other alternatives could be to just split all characters and see if we can join subsequent ones while matching graphemes, doing a "true" subfield partition (with a recursive function applied to both left and right subfields), etc.
good idea, I think it is worthwhile adding one of the solutions to the OP code, or to test it first a bit. In any way: this has been bothering me for some time now, and I agree it would be useful to handle it albeit in a pragmatic way.
Feel free to take the code above, or if you prefer I could fork it and do it after I finish my dataset.
A generic new function based in the first solution seems optimal, so that it can be used in testing (the best idea would seem to check differences in lexibank); if later a better solution/algorithm comes up, just replace the code.
Maybe let's leave it open for the time being and later discuss. Before next year, I think, I'd not find time to work this in, but as it's straightforward, it will be easy to do later so, I hope (I'd also be keen on testing against my dumb solution of just deleting one char from the string consecutively, until all are tested to find chars with errors no more than one char).
Ok, agreed.
2017-12-11 9:53 GMT-02:00 Johann-Mattis List notifications@github.com:
Maybe let's leave it open for the time being and later discuss. Before next year, I think, I'd not find time to work this in, but as it's straightforward, it will be easy to do later so, I hope (I'd also be keen on testing against my dumb solution of just deleting one char from the string consecutively, until all are tested to find chars with errors no more than one char).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cldf/segments/issues/27#issuecomment-350703322, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar93GzUZV6_hgaCsoTtBrCgDuYdmjMks5s_RexgaJpZM4Q8ovs .
Just a quick note for future development: I had forgotten about re.escape()
which hopefully would solve all the problems I mentioned about using re
.
@tresoldi looks useful - and nothing is really lost, because we still make sure the result is detectable as error. Do you want to submit a PR or should I merge your code into the package?
@xrotwang I could handle it, so we have a proper issue/review/etc. Just merge it if it is somewhat urgent for the refactoring you're working on, not sure I can finish it today (having to deal with the Ausländerbehörde is enough ;) ).
Just to be clear, we are talking about the second solution, with regular expressions and re.escape()
, right?
@tresoldi yes, the second solution. Thanks!
Ok, I'm on it.
@tresoldi quick question: The fix you are working on is purely an enhancement of find_missing_characters
, right? Because I'm playing around with the Tree.parse
, making it return the list of matched graphemes and the remainder of the string in case an error occured; i.e.:
>>> from segments.tree import Tree
>>> t = Tree(['sch', 'b', 'i', 's', 'ch', 'e', 'n'])
>>> t.parse('bischen')
(['b', 'i', 'sch', 'e', 'n'], '')
>>> t.parse('biscvhen')
(['b', 'i', 's'], 'cvhen')
>>> t.parse('bischven')
(['b', 'i', 'sch'], 'ven')
@tresoldi actually, I just got around to have Tree.parse
handle errors the way we would like, i.e. restarting after an error (and chopping off one character which possibly caused the error):
>>> from segments.tree import Tree
>>> t = Tree(['t', 'o', 'oː'])
>>> t.parse('toːt')
['t', 'o\xcb\x90', 't']
>>> t.parse('toːt=')
['t', 'o\xcb\x90', 't', u'\ufffd']
>>> t.parse('toːt=toːt')
['t', 'o\xcb\x90', 't', u'\ufffd', 't', 'o\xcb\x90', 't']
It is a way to describe it, yes. As the current version returns an empty string when it fails, my solution is just doing a greedy left-to-right parsing instead of a top-down tree. (a "better than nothing" approach).
That said, while your version is useful (we can know where it is failing), I am not sure if it is better for dataset cleaning than having t.parse('bischven')
return (['b', 'i', 'sch', '�', 'e', 'n'])
... We are probably once more with different ideas about how it should work in theory -- but yours is more elegant when thinking in abstract, that's for sure.
@tresoldi actually, I just got around to have Tree.parse handle errors the way we would like, i.e. restarting after an error (and chopping off one character which possibly caused the error):
That is great! No need for hacky work-arounds with regex! I'll for it to be incorporated and kill the PR later. :)
Yes, I think that's the way to go. It's a fairly minimal change in segments.tree
and the complete test suite still passes.
Great, closed the PR!
There is a behavior of the tokenizer which I believe is a bug, but might be known or even intended (it ends up helping to catch missing graphemes) -- or I might be missing something obvious (maybe I am building my profiles in a wrong way which is right enough to work in most cases?).
Suppose we have a profile with only three graphemes,
t
,o
, andoː
, mapped to their respective IPA values; the profile is loaded the standard way (passing the filename to__init__
, aspylexibank
does internally). If we__call__
the tokenizer on a well-formed string such astoːt
, it will correctly returnt oː t
. However, if we introduce an error from a string such astoːt=
, it will returnt o � t �
, failing to matchoː
. This is ultimately due to the way missing entries are handled:transform()
asks for thegraphemes()
, which are built by callingself.op.tree.parse()
(tokenizer.py:326); if parsing fails, all characters missing from the profile are marked as unknown (tokenizer.py:331). The error is that it should not look for unknown characters, but for unknown graphemes, which can be composed of more than one character (with all the caveats of first looking for the longer graphemes applying).There would be a few different ways for solving this (if it is supposed to be solved, that is), but I didn't want to proceed on my own without discussing it, especially because any change to
segments
might unintentionally break a few other operations down the line.