Fixed bug when tokenizing longer-than-one-char unknown graphemes.

tresoldi commented 6 years ago

This should solve https://github.com/cldf/segments/issues/27

In [1]: from segments import Profile, Tokenizer

In [2]: prf = Profile({'Grapheme':'t', 'mapping':'t'}, {'Grapheme':'o', 'mapping':'o'}, {'Graphem
   ...: e':'o:', 'mapping':'o:'})

In [3]: t = Tokenizer(profile=prf)

In [4]: t('tot')
Out[4]: 't o t'

In [5]: t('to:t')
Out[5]: 't o: t'

In [6]: t('to:t=')
Out[6]: 't o: t �'

Graphemes are re.escaped so no funny stuff should happen.

tresoldi commented 6 years ago

It is triggering a test assertion, I'll investigate.

tresoldi commented 6 years ago

So, this PR causes a number of assertions to fail in test_tokenizer.py/test_errors(), first of all this one:

    def test_errors():
        t = Tokenizer(_test_path('test.prf'), errors_replace=lambda c: '<{0}>'.format(c))
>       assert t('habe') == '<i> a b <e>'
E       AssertionError: assert '� a b �' == '<i> a b <e>'
E         - � a b �
E         + <i> a b <e>

This and other assertions are due to way self._errors is handled, which might now be outdated or superfluous: the current code should now guarantee that either we have a fully parsed tree (as returned by self.op.tree.parse()) or a list of all valid graphemes plus (indistinctly) � for all unmatched positions.

My idea is to:

change error replace from a dictionary to a string, so that the user can specify something different from � (but all unmatched positions would still be reported by one and only one string, with no more "try to guess the grapheme by its first character")
make it fail (raising ValueError) when self.op.tree.parse() fails
allow error ignore by (under the hood) replacing � with an empty string.

Given the current test, for string habe and a profile with no "h" nor "e", we would have:

� a b � by default
0 a b 0 with error replace and 0 as value (e.g.)
ValueError with error strict
a b with error ignore

What is your opinion?

cldf / segments

Fixed bug when tokenizing longer-than-one-char unknown graphemes. #31