WinAnsiEncoding quirks - Githubissues

JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.

Creative Commons Zero v1.0 Universal

183 stars 61 forks source link

From the PDF standard:

In WinAnsiEncoding , all unused codes greater than 40 map to the bullet character. However, only code 225 is specifically assigned to the bullet character; other codes are subject to future reassignment.

I fed this document in and got an encoding error that traced back to b'\x81 C'.decode("cp1252", "replace"). There's a bullet point in the corresponding position in the document. It appears that WinAnsiEncoding is a superset of CP-1252, because the Wikipedia article says:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes.

JoshData / pdf-redactor

WinAnsiEncoding quirks #6