Closed LinguList closed 7 years ago
Unfortunately, I started doing some streamlining of the package overlapping with your changes: https://github.com/bambooforest/segments/pull/4 Except for the handling of missing graphemes, I should have incorporated all of your changes, though.
I think I can merge your #4 and then re-submit this PR with the modification for the missing graphemes. Regarding that missing graphemes code, I was anyway thinking that it is suboptimal still, and should rather go to the algorithm that searches the grapheme for matches in general. But reporting suboptimal matches directly may also blow up the search space, and apart from running a Dijkstra-like algorithm searching for the best suboptimal combination of all combinations of ngrams, I don't know how to do this in a "complete" and non-approximative way. So in short: current solution is pragmatic rather than exact, but I think it is helpful for creating profiles.
Okay, no errors and conflicts with the merge I just created.
I introduce a couple of refinements (at least I consider them as refinements):
Example:
Note that the behaviour is still not completely as wanted, as seen in out[7], as I the mapping is not gready after the wrong match of (k).