Closed tresoldi closed 6 years ago
It is triggering a test assertion, I'll investigate.
So, this PR causes a number of assertions to fail in test_tokenizer.py/test_errors()
, first of all this one:
def test_errors():
t = Tokenizer(_test_path('test.prf'), errors_replace=lambda c: '<{0}>'.format(c))
> assert t('habe') == '<i> a b <e>'
E AssertionError: assert '� a b �' == '<i> a b <e>'
E - � a b �
E + <i> a b <e>
This and other assertions are due to way self._errors
is handled, which might now be outdated or superfluous: the current code should now guarantee that either we have a fully parsed tree (as returned by self.op.tree.parse()
) or a list of all valid graphemes plus (indistinctly) �
for all unmatched positions.
My idea is to:
error replace
from a dictionary to a string, so that the user can specify something different from �
(but all unmatched positions would still be reported by one and only one string, with no more "try to guess the grapheme by its first character")ValueError
) when self.op.tree.parse()
failserror ignore
by (under the hood) replacing �
with an empty string.Given the current test, for string habe
and a profile with no "h" nor "e", we would have:
� a b �
by default0 a b 0
with error replace
and 0
as value (e.g.)ValueError
with error strict
a b
with error ignore
What is your opinion?
This should solve https://github.com/cldf/segments/issues/27
Graphemes are
re.escape
d so no funny stuff should happen.