abarker / pdfCropMargins

pdfCropMargins -- a program to crop the margins of PDF files
Other
349 stars 32 forks source link

How to remove headers and footers permanently)? #52

Open Shohreh opened 1 year ago

Shohreh commented 1 year ago

Hello,

I don't know much about PDF, and am confused about box (mediabox, cropbox, etc.) and the units used in box and pdfCropMargins (pt vs. %).

What would be the right way to permanently — not just for viewing: The data must no longer be in the output file — remove the headers and footers on most pages of a PDF, while leaving some pages untouched (eg. the first page of each chapter)?

Thank you.

image

abarker commented 1 year ago

I'm hesitant to suggest a way to permanently remove margins, because if people want to use it for redaction they may end up being surprised. You mentioned mutool in your other issue, but I'm not certain how secure this removal is or exactly how it is implemented at the PDF level. It may be implemented the same way as pdfCropMargins and just modify the box data without changing the underlying PDF more than that.

abarker commented 1 year ago

Points are the standard unit of PDF files, 1 point = 1/72 inch. The percentage values take a percentage of the existing margins, for example if the existing margin is 100 points then 50% would reduce it to 50 points.

Shohreh commented 1 year ago

Thanks. I'll keep looking at a way to remove stuff I need permanently removed, either through changing the mediabox or redaction annotations.

DestoGit commented 1 year ago

Has a solution been implemented for this feature? It is badly needed. The current workaround I use is saving the pdfs as image only. And then performing ocr and saving it again with ABBY. Is there a way to do it without re-ocring if possible, and in batch over multiple pdfs at once?

Could this be used to auto detect and use as reference to crop? pdf header and footer detector

pdfminer, Apache Tika

grobid

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

[D] Data cleaning techniques for PDF documents with semantically meaningful parts

Perhaps these also for ideas How to extract and structure text from PDF files with Python and machine learning

Convert PDFs to Audiobooks with Machine Learning

How to convert PDFs to audiobooks with machine learning

pdf2audiobook

Shohreh commented 1 year ago

The work-around I found is 1) finding the coords with SumatraPDF (hit the "m" key to see the coordinates), and 2) running a Python script to add and delete redaction annotations.

abarker commented 1 year ago

All the current processing of PDF files is done with the PyMuPdf program. If there is a way to do this with that program then I would consider adding an option.

I'm not entirely clear what your exact use-case is. You want to remove the actual PDF content that is rendered outside a selected box, without turning the document into a rendered-image or scanned-style document? Does this need to be secure data destruction, such as for legal documents, etc.?

DestoGit commented 1 year ago

All the current processing of PDF files is done with the PyMuPdf program. If there is a way to do this with that program then I would consider adding an option.

I'm not entirely clear what your exact use-case is. You want to remove the actual PDF content that is rendered outside a selected box, without turning the document into a rendered-image or scanned-style document? Does this need to be secure data destruction, such as for legal documents, etc.?

Thanks for the reply and sorry for the late return.

The use case is to process many different books, articles, plays etc. with great variations in layouts and Headers and Footers locations.

Ideally, doing a batch process as this example: On a folder with say 1000 pdfs,

  1. Auto-Detect the main page body text block vs the Header and Footers text blocks.
  2. Auto-Crop to the main page body text block only.
  3. Save the pdfs with body only - no Header and Footer sub layer, with the ocr content intact but trimed of the Header and Footer ocr blocks.

The end use would be to then process as text to speech or to port to audio format. No secure data destruction needed, just removing the Header and footers text blocks so it does not appear in the end use process output.

My problem doing it with ABBYY is:

  1. The cropping needs manual click and drag to select the body dimension.
  2. Once cropped, the pdf output needs to be saved as image else the headers and footers text sub layer is still there when end use processing.
  3. Once saved as image, the pdf output needs to be re-Ocred, which takes time and is less accurate if the pdf was not a scanned one.
  4. Once re-ocred, the output pdf needs to be saved as searchable pdf.

Thanks again for your suggestions!