JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
184 stars 61 forks source link

Redaction swaps random characters #13

Open IvanIFChen opened 5 years ago

IvanIFChen commented 5 years ago

I have noticed that for some PDFs, the redaction will swap random characters, here's an example:

Original pdf:

screen shot 2018-10-17 at 3 30 12 pm

Phone number and email redacted:

screen shot 2018-10-17 at 3 29 57 pm

A regex that doesn't match anything:

screen shot 2018-10-17 at 3 42 53 pm

The content_filters successfully redacts the phone number and email, however, it's also changing all the es to is for some reason. I have done this to various pdfs, looks like this swapping issue isn't happening all the time and when it happens, it isn't consistent (es or is can be any letter).

Any ideas? I am happy to provide more examples.

JoshData commented 5 years ago

Interesting!

PDF text handling is very complicated and it may be a bug in how pdfrw understands the character encoding mechanism in PDFs. Or the PDF could be invalidly generated (but I wouldn't know how to figure that out). In PDFs, the rules to map font glyphs (e.g. characters) to a binary representation in the file is pretty ridiculous.

IvanIFChen commented 5 years ago

Thanks for the reply.

Do you mind just helping me figure out why the sample file isn't working? This same issue is also happening when running the python3 example.py < tests/test-ssns.pdf > document-redacted.pdf script showed in the README.

document-redacted.pdf

I am using:

Python 3.7.0
defusedxml==0.5.0
pdfrw==0.4

Update (after I opened the file I uploaded in chrome): The character swapping issue is... GONE?! Below are examples of opening the same pdf file in Preview, Adobe Acrobat Reader DC, and Chrome.

screen shot 2018-10-24 at 3 59 31 pm screen shot 2018-10-24 at 4 00 17 pm screen shot 2018-10-24 at 4 00 29 pm

Thoughts?

maybefreedom commented 5 years ago

I had the same problem, it's related to the make_mutable_streams function and the hexadecimal text. Don't know how to solve it though

IvanIFChen commented 5 years ago

Found a way to fix this... the from_bytes function in PdfString has an option to force it to encode in hex. Passing 'hex' to the function seems to fix the problem.

JoshData commented 5 years ago

Huh well if that fixes it, and if it doesn't break anything for anyone else, let's just make that change! Would you mind opening a pull request?

IvanIFChen commented 5 years ago

I think it does break something else! It looks like it's removing some text either before or after a redacted text, I am looking into this issue.

IvanIFChen commented 5 years ago

Nope nvm! It was another issue with how the text layer treats every line break as none-existing, my regex will treat the last word from previous line as a part of the email (with email regex) and vice-versa for the first word in next line. (e.g. Foo Barfoobar@gmail.com123-123-1234 when they're in separate lines on the pdf).

I fixed this issue by appending a unicode space (u' ')to the last TextToken of each PdfArray while doing the first step (building text layer), works in most of the case but not perfect.

The 'hex' change fixes the issue from my 30+ sample size. Will create a PR.

mkunz7 commented 3 years ago

It looks like this fix never got merged to master? It doesn't seem to work anyways I have random characters still being replaced.