Unicode data - Githubissues

stevenicholls99 commented 6 years ago

I am trying to build a parser for JP hence the training data is saved UTF-8. However the parserator throws a UnicodeDecodeError. Is there anything I can do to work around this? newaddr.csv attached - saved as .txt newaddr.txt

parserator label training/newaddr.csv training/newaddr.xml usaddress Traceback (most recent call last): File "c:\python27\lib\runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "c:\python27\lib\runpy.py", line 72, in _run_code exec code in run_globals File "c:\Python27\scripts\parserator.exe__main__.py", line 9, in File "c:\python27\lib\site-packages\parserator\main.py", line 58, in dispatch args.func(args) File "c:\python27\lib\site-packages\parserator\main.py", line 79, in label manual_labeling.label(module, infile_path, outfile_path) File "c:\python27\lib\site-packages\parserator\manual_labeling.py", line 207, in label strings = set(row[0] for row in reader) File "c:\python27\lib\site-packages\parserator\manual_labeling.py", line 207, in strings = set(row[0] for row in reader) File "c:\python27\lib\site-packages\backports\csv.py", line 394, in next lineobj = next(self.input_iter) File "c:\python27\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4: character maps to

jeancochrane commented 6 years ago

This isn't a part of the codebase that I'm super familiar with, but you might be able to hack together a solution by modifying the list comprehension in parserator that's failing. I'd try something like:

strings = set(row[0].decode('utf-8') for row in reader)

fgregg commented 6 years ago

The more correct thing to do here would be to optionally use backports.csv if we are using python 2. https://pypi.python.org/pypi/backports.csv

fgregg commented 6 years ago

closed by https://github.com/datamade/parserator/commit/f61465e82867f94daf5e781d8f3b498b0713d061

datamade / usaddress

Unicode data #215