JakubMelka / PDF4QT

Open source PDF editor.
https://jakubmelka.github.io/
GNU Lesser General Public License v3.0
659 stars 67 forks source link

Redaction plugin removes whole PDF text layer #47

Open ondrejvrabel opened 1 year ago

ondrejvrabel commented 1 year ago

Hi, when testing the redaction plugin, I noticed that the text layer from PDF is not edited/redacted, it is completely removed. Would it be possible to keep the text layer in the document where redactions were not made? Thank you

JakubMelka commented 1 year ago

Hello,

I will explain how the algorithm works. The redaction engine draws each page as vector graphics to the new PDF, where redacted areas are removed from this vector graphics. It is most safe, however, it removes text layer and only shapes (vector graphics) remains.

If I understand it correctly, it is wished to keep the text layer where redactions were not made?

Will it be OK to do it on page level (so pages with no redacted text will remain the same), or you would like to do it on content level (so only really redacted page area will be affected)? The latter will be time consuming.

ondrejvrabel commented 1 year ago

I'd like to do it on content level (only really redacted page area will be affected). I understand that this task is more time consuming, it's quite complex, however, the main reason for this is accessibility and machine-readability of the documents.

We'd like to use it when publishing contracts and other legal documents from small town hall in Central Evidence of Contracts (crz.gov.sk), as previously there have been dozens of incorrectly redacted or even not redacted documents at all. Employees use paper overlays and scanning of pages to redact documents, which is time consuming, environmentally bad and basically stupid. Their "solution" is awful and I want to teach them how to redact documents with software. The current set of features is enough for us to redact documents legally, but it would be nice if the text layers were there for ease of accessibility and copying contract texts.

Thank you for your work!

alcir commented 12 months ago

I second what @ondrejvrabel is saying. It would be nice to redact some text while keeping not redacted text as text (I mean, the non redacted text should be still selectable, that is copied or higlighted, with a PDF viewer).

AlisterH commented 1 month ago

You could even keep redacted text as "text" by replacing the text with "Full block" characters. Not sure if there are any tools that do this, but there is one to help people doing it manually: https://github.com/devkev/redact-magic-paste