JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
183 stars 61 forks source link

deleted letters from PDF even if those letters were present in the source document #27

Open jakubsiast opened 3 years ago

jakubsiast commented 3 years ago

I used pdf-redactor to change some text in a pdf file, but in part of the pdf I've lost all the 'n' characters. The affected text was not the one that I hoped to change. The text was handled in the "class TextToken" by the "str(self)" function as an unchanged text, i.e., it passes through condition "if self.value == self.original_value:". Nevertheless it has changed. What I managed to do is to track that the function to blame is "PdfString.from_bytes(...)" in line 379 of pdf_redactor.py:

If unchanged, return the raw original value without decoding/encoding.

        return PdfString.from_bytes(self.raw_original_value) 

By forcing the encoding of the unchanged TextToken to 'hex' I managed to fix the issue: return PdfString.from_bytes(self.raw_original_value, bytes_encoding = 'hex') This simple change helped in my case, but I do not know if it is a general case. Can you try this and, eventually push this fix to your code?