coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
GNU Affero General Public License v3.0
593 stars 42 forks source link

Crop pdf and text content using CPDF. #55

Closed amine-aboufirass closed 2 years ago

amine-aboufirass commented 3 years ago

Hello,

I would like to use cpdf to crop a PDF and make sure that the text content is also cropped.

I have tried multiple combinations of -cropbox, -mediabox and hardbox related commands. I alternate with some PyPDF2 code to open the pdf stream and extract the text content. Unfortunately, no matter which type of box I use, the extracted text always corresponds to the contents of the original uncropped file. In contrast, when I open the resulting file it is indeed cropped. I am guessing cpdf crops the page, but preserves the original data.

How can I crop both the PDF and preserve only the text content pertaining to the cropped region using cpdf? Thanks for your consideration.

johnwhitington commented 3 years ago

Cpdf has no facilities for this. Cropping just changes the box. Hardbox just clips to a box.

What you are looking for is called "redaction", so searching for "PDF redaction" should find you a product. I don't know of any good free redaction software, but maybe there is some.

amine-aboufirass commented 3 years ago

@johnwhitington Thanks for your response. I see what you're saying, cpdf is not the right tool for what I would like to do.

I've done a bit of research on redaction tools, mainly command line and python-based tools. It seems impossibly difficult to actually modify a pdf in this way. Am I right in saying that only proprietary tools such as Adobe are able to do this sort of thing?

johnwhitington commented 3 years ago

I don't know of a good open source tool for this. Redacting based on a crop box is harder than search-and-replace redacting, because you need to calculate the position of every piece of text on the page to see if it needs to be redacted.