Non UTF-8 character fonts cause `UnicodeDecodeError`

I'm trying to parse a PDF that contains Chinese characters. The text is extracted okay, but when I try to access fonts, I get the following error:

>>> box.get_font_name()  # Assume the box is extracted from some page, this box contains Chinese characters
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<PATH>/lib/python3.7/site-packages/poppler/utilities.py", line 90, in wrapped
    return fct(*args, **kwargs)
  File "<PATH>/lib/python3.7/site-packages/poppler/page.py", line 64, in get_font_name
    return self._text_box.get_font_name(i)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 7: invalid start byte

Trying to iterate fonts through the document itself results in the same error.

Environment: Python 3.7.4 Poppler 21.12.0 (Compiled from source). Happens on both Mac and Ubuntu.

I have seen other poppler bindings, such as this one that handles those errors (by using the replace keyword for decoding the string), but unfortunately it uses deprecated internal APIs and cannot be used with a newer version of poppler (even when trying to build from source).

If there was somehow a way to supply the required encoding or even suppress/ignore those errors, it would be very benficial. I have seen another comment on another ticket that says we can request to expose the encoding/decoding in the cpp backend.

cbrunet / python-poppler

Non UTF-8 character fonts cause `UnicodeDecodeError` #50