atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

Problem with parsing PDF table with unicode characters #322

Closed OlegGavrilov closed 4 years ago

OlegGavrilov commented 5 years ago

Hello! Sorry for reporting a minor issue, but when I tried to parse table with Unicode characters using Excalibur front-end, I got an error:

ERROR:root:'ascii' codec can't encode character u'\xf6' in position 376: ordinal not in range(128)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/excalibur/tasks.py", line 123, in extract
    tables.export(f_datapath, f=f, compress=True)
  File "/usr/local/lib/python2.7/dist-packages/camelot/core.py", line 701, in export
    self._write_file(f=f, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/camelot/core.py", line 659, in _write_file
    to_format(filepath)
  File "/usr/local/lib/python2.7/dist-packages/camelot/core.py", line 594, in to_html
    f.write(html_string)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 376: ordinal not in range(128)

Fixed that by adding .encode('utf-8') at core.py:594.

Don't know if this is a good fix, but just hope it can help someone.

Thanks for the amazing project!

ngenovictor commented 5 years ago

Got the same error too and the change also worked out for me.

vinayak-mehta commented 4 years ago

Closing because there's no PDF to reproduce this issue.