JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
180 stars 61 forks source link

WinAnsiEncoding quirks #6

Open divergentdave opened 7 years ago

divergentdave commented 7 years ago

From the PDF standard:

In WinAnsiEncoding , all unused codes greater than 40 map to the bullet character. However, only code 225 is specifically assigned to the bullet character; other codes are subject to future reassignment.

I fed this document in and got an encoding error that traced back to b'\x81 C'.decode("cp1252", "replace"). There's a bullet point in the corresponding position in the document. It appears that WinAnsiEncoding is a superset of CP-1252, because the Wikipedia article says:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes.

JoshData commented 7 years ago

Ugh.

The real way to do all of this is probably to use pdfminer's font modules, which seem to have a pretty complete implementation of encodings and glyph->unicode mappings. The problem is the pdfrw data structures have to e mapped to whatever pdfminer can load things from. Also pdfminer is Py2-only and there's a separate fork for Py3.

Or, we could possibly pull out just the character encoding tables from pdfminer.