This code is only a couple dozen lines long, so how is it possible it has a bug? Unicode, that's how.
When this error is thrown and the input is a unicode object containing non-ascii characters, this code throws something like this:
Error
Traceback (most recent call last):
File "/home/mlissner/Programming/intellij/courtlistener/cl/lib/tests.py", line 554, in test_normalize_atty_contact
result = normalize_attorney_contact(pair['q'])
File "/home/mlissner/Programming/intellij/courtlistener/cl/lib/pacer.py", line 439, in normalize_attorney_contact
tag_mapping=mapping,
File "/home/mlissner/.virtualenvs/courtlistener/local/lib/python2.7/site-packages/usaddress/__init__.py", line 178, in tag
label)
File "/home/mlissner/.virtualenvs/courtlistener/local/lib/python2.7/site-packages/probableparsing/__init__.py", line 22, in __init__
repo_url=self.REPO_URL)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 55: ordinal not in range(128)
This is unfortunate because my calling code is designed to catch RepeatedLabelErrors not UnicodeEncodeErrors (and ideally it'd stay that way).
I guess the solution is to assume that people are going to use unicode within their code instead of strings, and to tweak the code by adding a u before the definition of MESSAGE and DOCS_MESSAGE. When I do that, it fixes the bug.
What I haven't thought through is what happens if people are using strings in the parser instead of using unicode objects. In that case, do things get better or worse?
This code is only a couple dozen lines long, so how is it possible it has a bug? Unicode, that's how.
When this error is thrown and the input is a unicode object containing non-ascii characters, this code throws something like this:
This is unfortunate because my calling code is designed to catch RepeatedLabelErrors not UnicodeEncodeErrors (and ideally it'd stay that way).
I guess the solution is to assume that people are going to use unicode within their code instead of strings, and to tweak the code by adding a
u
before the definition ofMESSAGE
andDOCS_MESSAGE
. When I do that, it fixes the bug.What I haven't thought through is what happens if people are using strings in the parser instead of using unicode objects. In that case, do things get better or worse?