cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

refined functionality of tokenizer for interaction with other libraries #3

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

I introduce a couple of refinements (at least I consider them as refinements):

Example:

In [1]: from segments.tokenizer import Tokenizer
In [2]: t = Tokenizer([['graphemes', 'ipa'], ['th', 'T'], ['kh', 'K'], ['a', 'a'], ['aa', 'A']])
In [3]: t.transform('khakha', 'ipa')
Out[3]: 'K a K a'
In [4]: t.transform('khaka', 'ipa')
Out[4]: 'K a <k> a'
In [7]: t.transform('khaakaa', 'ipa', missing=lambda x: '('+x+')')
Out[7]: 'K A (k) a a'
In [9]: t.transform('khaak aapa', 'ipa', missing=lambda x: '('+x+')', separator=' + ')
Out[9]: 'K A (k) + A (p) a'

Note that the behaviour is still not completely as wanted, as seen in out[7], as I the mapping is not gready after the wrong match of (k).

xrotwang commented 7 years ago

Unfortunately, I started doing some streamlining of the package overlapping with your changes: https://github.com/bambooforest/segments/pull/4 Except for the handling of missing graphemes, I should have incorporated all of your changes, though.

LinguList commented 7 years ago

I think I can merge your #4 and then re-submit this PR with the modification for the missing graphemes. Regarding that missing graphemes code, I was anyway thinking that it is suboptimal still, and should rather go to the algorithm that searches the grapheme for matches in general. But reporting suboptimal matches directly may also blow up the search space, and apart from running a Dijkstra-like algorithm searching for the best suboptimal combination of all combinations of ngrams, I don't know how to do this in a "complete" and non-approximative way. So in short: current solution is pragmatic rather than exact, but I think it is helpful for creating profiles.

LinguList commented 7 years ago

Okay, no errors and conflicts with the merge I just created.