cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Fixed bug when tokenizing longer-than-one-char unknown graphemes. #31

Closed tresoldi closed 6 years ago

tresoldi commented 6 years ago

This should solve https://github.com/cldf/segments/issues/27

In [1]: from segments import Profile, Tokenizer

In [2]: prf = Profile({'Grapheme':'t', 'mapping':'t'}, {'Grapheme':'o', 'mapping':'o'}, {'Graphem
   ...: e':'o:', 'mapping':'o:'})

In [3]: t = Tokenizer(profile=prf)

In [4]: t('tot')
Out[4]: 't o t'

In [5]: t('to:t')
Out[5]: 't o: t'

In [6]: t('to:t=')
Out[6]: 't o: t �'

Graphemes are re.escaped so no funny stuff should happen.

tresoldi commented 6 years ago

It is triggering a test assertion, I'll investigate.

tresoldi commented 6 years ago

So, this PR causes a number of assertions to fail in test_tokenizer.py/test_errors(), first of all this one:

    def test_errors():
        t = Tokenizer(_test_path('test.prf'), errors_replace=lambda c: '<{0}>'.format(c))
>       assert t('habe') == '<i> a b <e>'
E       AssertionError: assert '� a b �' == '<i> a b <e>'
E         - � a b �
E         + <i> a b <e>

This and other assertions are due to way self._errors is handled, which might now be outdated or superfluous: the current code should now guarantee that either we have a fully parsed tree (as returned by self.op.tree.parse()) or a list of all valid graphemes plus (indistinctly) for all unmatched positions.

My idea is to:

Given the current test, for string habe and a profile with no "h" nor "e", we would have:

What is your opinion?