Evaluate Dangerzone's Potential as a Redaction Tool (and add redaction capabilities)

deeplow commented 3 months ago

Dangerzone's goal is protecting the user against malware. However, thought the way it works, it also removes metadata. So it can also help with publication security.

The problem

Typical PDFs manipulation tools have poorly implemented redaction methods that can be reversed. Because Dangerzone already rasterizes documents, it has nothing to loose. When a black box is applied and then rasterized, there is no more information in the final output.

This is best put in the paper Story Beyond the Eye: Glyph Positions Break PDF Text Redaction (emphasis added):

Rasterization appears to be an effective defense against deredaction. In many cases this defense is infeasible be- cause it removes searchable text data from the document, however, performing OCR on the document post-redaction can act as a stop-gap for this issue. Rasterization algorithms may also modify or ignore certain glyph shifts,17 requiring the analyst to perform more reverse engineering to identify the specific rasterization tool used.

We're working on turning Dangerzone into a file view and that could be the perfect change to add redaction tools.

User Story

As a journalist, I'd like to have use dangerzone to help redact documents, ensuring that redactions cannot be reversed.

How could this work?

User journey:

In the view mode user draws black squares over blacked out area
After all redactions are done, the user saves the final document

Technical explanation: the host receives all the rasterized images. As the user adds a black box to the image, with the help of an image manipulation module (like Pillow) it adds those black boxes to the final image. If we want extra rasterization assurances, we can convert final PDF though dangerzone one more time to ensure proper rasterization.

Implementation Risks and Unmitigated Risks

We should keep in mind that redaction alone may not be to eliminate all unredaction risks. The best advice is never to publish source documents and if needed, to retype them. I can think of several other ways that redaction could still be bypassed:

invisible watermarks: if the purpose is to identify the leaker, then printer dots, space-width variations, etc. could all be used. No redaction can save this form of identification. Only document retyping can potentially help there.
character width can be used to reverse redactions (related paper)
compression artifacts can leave traces of what was hidden. In pre-compressed artifacts like images we cannot help much, as the whole element has to be redacted. However, dangerzone also compresses documents. We could make sure to only do this in the final rasterization (i.e. the one with the redaction boxes).

deeplow commented 3 months ago

If the previewer ends up using PDFs rather ran images, we can apparently use fitz for that (linked issue would not affect us if the doc was already rasterized once).

DeltaEpsilon19498 commented 2 months ago

Could dangerzone convert the text in the pdf to a .txt file which the journalist could redact manually? Things like black boxes still give away the length of the word being redacted. Then, could a tool be used to convert the redacted text into a pdf document with a template that could be standardized across the industry as a "redacted anti-watermark whistleblowing" template? That way, all watermarks could be removed, except if the corporation or government modifies the text itself a little bit depending on which authorized user is reading it.

With corporations already putting invisible watermarks or whatever into their emails, the above idea could help protect sources. One issue is that with the document modified so much, the corporation or government could deny that it is a legitimate document and claim that it is faked. A second issue is, as mentioned, that they could adapt by modifying the words used in the document depending on the authorized viewer. Third, the leaked material might have important images or diagrams that need to be part of the document but which contain undetectable watermarks. And a fourth issue is that readers / viewers of the general public may be too ignorant to understand why these sorts of measures are necessary, causing them to doubt the authenticity of the document or be manipulated by propaganda. So idk. A tool like the one I presented in the first paragraph might still be useful though.

apyrgio commented 2 months ago

To the above reservations, I'd add the fact that some documents may have two columns of text, pictures, or formatting elements like tables. If it's a solution that works for 90% of the documents, then we will add some extra mental load to a journalist that is already pressed (given that they are handling a very sensitive document).

Still, allowing users to get back just the text of the document, and then post-process it in anyway they like, could be a nice fit for a Dangerzone plugin system. I think we had an issue for this, but I can't find it right now.

freedomofpress / dangerzone