refined functionality of tokenizer for interaction with other libraries

LinguList commented 7 years ago

I introduce a couple of refinements (at least I consider them as refinements):

one can now also pass an orthoprofile as list, which makes it easy to integrate orthoprofiles in software without loading them
the behavior of the transform command now offers the conversion of non-recognized items, following a greedy search for the best matches. This is helpful for the creation of orthoprofiles, as it tells the user where a particular conversion failed.
one can specify how missing items should be displayed, to avoid that they pass as normally converted items, default being .

Example:

In [1]: from segments.tokenizer import Tokenizer
In [2]: t = Tokenizer([['graphemes', 'ipa'], ['th', 'T'], ['kh', 'K'], ['a', 'a'], ['aa', 'A']])
In [3]: t.transform('khakha', 'ipa')
Out[3]: 'K a K a'
In [4]: t.transform('khaka', 'ipa')
Out[4]: 'K a <k> a'
In [7]: t.transform('khaakaa', 'ipa', missing=lambda x: '('+x+')')
Out[7]: 'K A (k) a a'
In [9]: t.transform('khaak aapa', 'ipa', missing=lambda x: '('+x+')', separator=' + ')
Out[9]: 'K A (k) + A (p) a'

Note that the behaviour is still not completely as wanted, as seen in out[7], as I the mapping is not gready after the wrong match of (k).

xrotwang commented 7 years ago

Unfortunately, I started doing some streamlining of the package overlapping with your changes: https://github.com/bambooforest/segments/pull/4 Except for the handling of missing graphemes, I should have incorporated all of your changes, though.

LinguList commented 7 years ago

I think I can merge your #4 and then re-submit this PR with the modification for the missing graphemes. Regarding that missing graphemes code, I was anyway thinking that it is suboptimal still, and should rather go to the algorithm that searches the grapheme for matches in general. But reporting suboptimal matches directly may also blow up the search space, and apart from running a Dijkstra-like algorithm searching for the best suboptimal combination of all combinations of ngrams, I don't know how to do this in a "complete" and non-approximative way. So in short: current solution is pragmatic rather than exact, but I think it is helpful for creating profiles.

LinguList commented 7 years ago

Okay, no errors and conflicts with the merge I just created.

cldf / segments

refined functionality of tokenizer for interaction with other libraries #3