cbrunet / python-poppler

Python binding to Poppler-cpp pdf library
GNU General Public License v2.0
95 stars 15 forks source link

Non UTF-8 character fonts cause `UnicodeDecodeError` #50

Open zionsofer opened 2 years ago

zionsofer commented 2 years ago

I'm trying to parse a PDF that contains Chinese characters. The text is extracted okay, but when I try to access fonts, I get the following error:

>>> box.get_font_name()  # Assume the box is extracted from some page, this box contains Chinese characters
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<PATH>/lib/python3.7/site-packages/poppler/utilities.py", line 90, in wrapped
    return fct(*args, **kwargs)
  File "<PATH>/lib/python3.7/site-packages/poppler/page.py", line 64, in get_font_name
    return self._text_box.get_font_name(i)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 7: invalid start byte

Trying to iterate fonts through the document itself results in the same error.

Environment: Python 3.7.4 Poppler 21.12.0 (Compiled from source). Happens on both Mac and Ubuntu.

I have seen other poppler bindings, such as this one that handles those errors (by using the replace keyword for decoding the string), but unfortunately it uses deprecated internal APIs and cannot be used with a newer version of poppler (even when trying to build from source).

If there was somehow a way to supply the required encoding or even suppress/ignore those errors, it would be very benficial. I have seen another comment on another ticket that says we can request to expose the encoding/decoding in the cpp backend.

cbrunet commented 2 years ago

poppler-cpp gives the font name as std::string, not as ustring. Therefore, I think the bug must be resolved upstream, unless we used some heuristics to guess the encoding, which would probably be fragile.