Open divergentdave opened 8 years ago
Ugh.
The real way to do all of this is probably to use pdfminer's font modules, which seem to have a pretty complete implementation of encodings and glyph->unicode mappings. The problem is the pdfrw data structures have to e mapped to whatever pdfminer can load things from. Also pdfminer is Py2-only and there's a separate fork for Py3.
Or, we could possibly pull out just the character encoding tables from pdfminer.
From the PDF standard:
I fed this document in and got an encoding error that traced back to
b'\x81 C'.decode("cp1252", "replace")
. There's a bullet point in the corresponding position in the document. It appears that WinAnsiEncoding is a superset of CP-1252, because the Wikipedia article says: