Open IvanIFChen opened 6 years ago
Interesting!
PDF text handling is very complicated and it may be a bug in how pdfrw understands the character encoding mechanism in PDFs. Or the PDF could be invalidly generated (but I wouldn't know how to figure that out). In PDFs, the rules to map font glyphs (e.g. characters) to a binary representation in the file is pretty ridiculous.
Thanks for the reply.
Do you mind just helping me figure out why the sample file isn't working? This same issue is also happening when running the python3 example.py < tests/test-ssns.pdf > document-redacted.pdf
script showed in the README.
I am using:
Python 3.7.0
defusedxml==0.5.0
pdfrw==0.4
Update (after I opened the file I uploaded in chrome): The character swapping issue is... GONE?! Below are examples of opening the same pdf file in Preview, Adobe Acrobat Reader DC, and Chrome.
Thoughts?
I had the same problem, it's related to the make_mutable_streams function and the hexadecimal text. Don't know how to solve it though
Found a way to fix this... the from_bytes
function in PdfString
has an option to force it to encode in hex. Passing 'hex'
to the function seems to fix the problem.
Huh well if that fixes it, and if it doesn't break anything for anyone else, let's just make that change! Would you mind opening a pull request?
I think it does break something else! It looks like it's removing some text either before or after a redacted text, I am looking into this issue.
Nope nvm! It was another issue with how the text layer treats every line break as none-existing, my regex will treat the last word from previous line as a part of the email (with email regex) and vice-versa for the first word in next line. (e.g. Foo Barfoobar@gmail.com123-123-1234
when they're in separate lines on the pdf).
I fixed this issue by appending a unicode space (u' '
)to the last TextToken
of each PdfArray
while doing the first step (building text layer), works in most of the case but not perfect.
The 'hex'
change fixes the issue from my 30+ sample size. Will create a PR.
It looks like this fix never got merged to master? It doesn't seem to work anyways I have random characters still being replaced.
I have noticed that for some PDFs, the redaction will swap random characters, here's an example:
Original pdf:
Phone number and email redacted:
A regex that doesn't match anything:
The
content_filters
successfully redacts the phone number and email, however, it's also changing all thee
s toi
s for some reason. I have done this to various pdfs, looks like this swapping issue isn't happening all the time and when it happens, it isn't consistent (e
s ori
s can be any letter).Any ideas? I am happy to provide more examples.