firstlookmedia / pdf-redact-tools

a set of tools to help with securely redacting and stripping metadata from documents before publishing
Other
533 stars 48 forks source link

Convert PDFs to black and white to remove printer dots #23

Closed gszathmari closed 7 years ago

gszathmari commented 7 years ago

Adding optional switch to convert the document to black and white to remove printer dots

gszathmari commented 7 years ago

The following two commands can help reveal the yellow, almost invisible printer dots:

convert -channel RG -fx 0 page-0.png blue.png convert -fx b page-0.png grey.png

Before printer dot sanitisation

blue

grey

After printer dot sanitisation

blue2

grey2

Frankkkkk commented 7 years ago

I would also apply some mathematical morphology (i.e. erode then dilate) in order to remove lone black pixels that may be used to transmit information.

coventry commented 7 years ago

Why not just run the document through OCR and publish that?

Frankkkkk commented 7 years ago

Wouldn't you lose graphics ?

ghost commented 7 years ago

Erosion and dilation might work, but might also change the appearance of the text, depending upon the dot size. The simple black and white conversion is probably a more reliable method since it doesn't depend upon knowing the dot size.

Frankkkkk commented 7 years ago

Should we only limit to yellow points or also to random pixel-encoded messages ? In which cases the erosion-dilatation would work. Sure, it would change a bit the appearance of the text, but per experience its still very lisible.

bill-mcgonigle commented 7 years ago

If your purpose is to proactively protect whistleblowers, you cannot assume that tracking dots will always be yellow or not added to low-significance bits of high-entropy areas. This is just a matter of printer firmware revisions, well within the means of wealthy interests. Erosion-dilatation would work but you can also apply higher order statistical models to detect steganographic information hiding. See the steg sections here (one has code): http://www.cs.dartmouth.edu/farid/#jumpTo

jbolger commented 7 years ago

I believe this pull request is confusing the purpose of PDF Redact Tools. The purpose is to sanitize pdf files so journalists can view their contents while minimizing the risk of compromise to their computer. The purpose is not to obfuscate the source of the pdf, that is outside the scope of PDF Redact Tools.

While I believe the goal this pull request tries to accomplish is very important, I feel like it is out of place in PDF Redact Tools. This commit supposedly counters one of the known ways documents may be visually tagged, but there are an infinite number of other tagging techniques which this commit will not address. PDF Redact Tools was never meant to solve this problem, and therefore this problem should be solved by a dedicated tool that was designed to address this problem. Bloating PDF Redact Tools will lower the quality of the tool and exacerbate source exposure problems like these.

PDF Redact Tools was not meant to be the final step before publishing, it was meant to be the first step before reporting.

This feature, to me, sounds like one of dozens that could potentially belong to a new tool which addresses this problem from the start. PDF Redact Tools does its job well, we shouldn't cloud its mission - and we don't want to give journalists a watered-down version of source obfuscation if a better tool can be made for it.

ajkblue commented 7 years ago

@jbolger

PDF Redact Tools was not meant to be the final step before publishing, it was meant to be the first step before reporting.

From the Readme:

PDF Redact Tools helps with securely redacting and stripping metadata from documents before publishing.

It seems to me that this fits into this project. Not only that, but while yes, this may not be 100% effective or guarantee that every tracking dot is removed from the source, at least it's there as an option. If you really think that this shouldn't be addd by default, then an easier way to include this as a useful feature could be to add a new command-line flag to use it, e.g. --remove-dots or something along the lines of that.

I see your point and where you're coming from. Nothing is perfect, but this is at least a start. Nothing like this seems to exist on Github, and as so I believe that this is a good feature addition to PDF Redact Tools. So unless a new tool is going to be started that implements this feature, adding it to PDF Redact Tools at least gives journalists the option of using it. Again, maybe not as the default if that might add a false sense of security, but at least it's there.

micahflee commented 7 years ago

This is an interesting pull request, thanks for submitting it!

I agree with @jbolger that there are infinite ways in which printers can hide metadata in files they print (or for people to hide any data within arbitrary images), and that PDF Redact Tools can't hope to -- and shouldn't try to -- prevent all of them.

However, I think it's reasonable to specifically protect against printer dots because they're so ubiquitous, and likely on every piece of paper that has something printed on it. According to EFF's (no-longer-updated) list of printers that include tracking dots, "Some of the documents that we previously received through FOIA suggested that all major manufacturers of color laser printers entered a secret agreement with governments to ensure that the output of those printers is forensically traceable."

I've tested and confirmed that this does seem to work and remove printer dots from a scanned document that had them included. It also resulted in a much smaller filesize, which is nice.

However, unfortunately it reduces the quality of the resulting PDF by a lot (probably because it's using threshold to convert to black and white, rather than just grayscale, which wouldn't remove the printer dots). And, of course, it loses color, which might be important to some docs.

So while it's imperfect, I'm into merging this. I'm also open to merging a new PR, if anyone can come up with a way to remove printer dots without reducing so much quality, and ideally even without removing color.

timojuez commented 7 years ago

Hi folks,

from ongoing research I can tell you that all the printers that make yellow dots use colours between the HSV values (28,10,214) and (48,67,255). Making a picture black and white may still contain the dots, depending on the algorithm that maps HSV into black and white. I would suggest to replace spots in the named HSV range with the paper's average white.