datamade / probableparsing

Common methods for probable parsers
MIT License
5 stars 7 forks source link

Unicode objects with non-ascii characters throw UnicodeEncodingError exception (instead of RepeatedLabelError) #2

Open mlissner opened 6 years ago

mlissner commented 6 years ago

This code is only a couple dozen lines long, so how is it possible it has a bug? Unicode, that's how.

When this error is thrown and the input is a unicode object containing non-ascii characters, this code throws something like this:

Error
Traceback (most recent call last):
  File "/home/mlissner/Programming/intellij/courtlistener/cl/lib/tests.py", line 554, in test_normalize_atty_contact
    result = normalize_attorney_contact(pair['q'])
  File "/home/mlissner/Programming/intellij/courtlistener/cl/lib/pacer.py", line 439, in normalize_attorney_contact
    tag_mapping=mapping,
  File "/home/mlissner/.virtualenvs/courtlistener/local/lib/python2.7/site-packages/usaddress/__init__.py", line 178, in tag
    label)
  File "/home/mlissner/.virtualenvs/courtlistener/local/lib/python2.7/site-packages/probableparsing/__init__.py", line 22, in __init__
    repo_url=self.REPO_URL)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 55: ordinal not in range(128)

This is unfortunate because my calling code is designed to catch RepeatedLabelErrors not UnicodeEncodeErrors (and ideally it'd stay that way).

I guess the solution is to assume that people are going to use unicode within their code instead of strings, and to tweak the code by adding a u before the definition of MESSAGE and DOCS_MESSAGE. When I do that, it fixes the bug.

What I haven't thought through is what happens if people are using strings in the parser instead of using unicode objects. In that case, do things get better or worse?

fgregg commented 6 years ago

You are right that the fix is to make sure that the right fix is to make sure that MESSAGE and DOCS_MESSAGE are unicode.

Would you mind making that PR to https://github.com/datamade/probableparsing