abarker / pdfCropMargins

pdfCropMargins -- a program to crop the margins of PDF files
Other
361 stars 35 forks source link

Using -gs option for the pdf converted from word (only text) #24

Closed sant527 closed 4 years ago

sant527 commented 4 years ago

I want to know will -gs (using ghostscript for bounding box) crop the same as without this option, if my document is not scanned but a word text (no images) converted to pdf.

sant527 commented 4 years ago

I found that in some pdfs its not cropping at all with -gs, but with some its perfect. Because pdftoppm uses lot of space in tmp for long documents. this is really good

sant527 commented 4 years ago

I am not able to crop this file ADI_41_43.pdf using -gs. Its cropped file is ADI_41_43_cropped.pdf

$ pdf-crop-margins -v -gs -p4 100 0 100 100 ADI_41_43.pdf

Processing the PDF with pdfCropMargins (version 0.2.6)...
System type: Linux

The input document's filename is:
    ADI_41_43.pdf

Using the default-generated output filename.

The output document's filename will be:
    ADI_41_43_cropped.pdf

The absolute pre-crops to be applied to each margin, in units of bp, are:
    [0.0, 0.0, 0.0, 0.0]

The percentages of margins to retain are:
    [100.0, 0.0, 100.0, 100.0]

The absolute offsets to be applied to each margin, in units of bp, are:
    [0.0, 0.0, 0.0, 0.0]

The uniform order statistics to apply to each margin, in units of bp, are:
    []

For the full page size, using values from the PDF box
specified by the intersection of these boxes: ['c']

Found Ghostscript program at: gs

The input document has 3 pages.

The document's metadata, if set:

   The Author attribute set in the input document is:
      None
   The Creator attribute set in the input document is:
      None
   The Producer attribute set in the input document is:
      PyPDF2
   The Subject attribute set in the input document is:
      None
   The Title attribute set in the input document is:
      None

All the pages of the document will be cropped.

Original full page sizes, in PDF format (lbrt):
    1   rot = 0      RectangleObject([0, 0, 432, 1080])
    2   rot = 0      RectangleObject([0, 0, 432, 1080])
    3   rot = 0      RectangleObject([0, 0, 432, 1080])

Copied these items from the document catalog:
   /Type
Skipped copy of these items from the document catalog:
   /Pages

The document was not previously cropped by pdfCropMargins.

Writing out the PDF with the CropBox and MediaBox redefined.

Using Ghostscript to calculate the bounding boxes.

The bounding boxes are:
     1   [0.0, 0.96, 432.000016, 1072.800041]
     2   [0.0, 0.96, 432.000016, 1072.800041]
     3   [0.0, 0.96, 432.000016, 1072.800041]

New full page sizes after cropping, in PDF format (lbrt):
    1    RectangleObject([0, 0.96, 432, 1080])
    2    RectangleObject([0, 0.96, 432, 1080])
    3    RectangleObject([0, 0.96, 432, 1080])

Writing the cropped PDF file.

Finished this run of pdfCropMargins.

Whereas i have another pdf file MAD2_28_31.pdf which gets cropped as per the command

$ pdf-crop-margins -v -gs -p4 100 0 100 100 MAD2_28_31.pdf 

Processing the PDF with pdfCropMargins (version 0.2.6)...
System type: Linux

The input document's filename is:
    MAD2_28_31.pdf

Using the default-generated output filename.

The output document's filename will be:
    MAD2_28_31_cropped.pdf

The absolute pre-crops to be applied to each margin, in units of bp, are:
    [0.0, 0.0, 0.0, 0.0]

The percentages of margins to retain are:
    [100.0, 0.0, 100.0, 100.0]

The absolute offsets to be applied to each margin, in units of bp, are:
    [0.0, 0.0, 0.0, 0.0]

The uniform order statistics to apply to each margin, in units of bp, are:
    []

For the full page size, using values from the PDF box
specified by the intersection of these boxes: ['c']

Found Ghostscript program at: gs

The input document has 4 pages.

The document's metadata, if set:

   The Author attribute set in the input document is:
      None
   The Creator attribute set in the input document is:
      None
   The Producer attribute set in the input document is:
      PyPDF2
   The Subject attribute set in the input document is:
      None
   The Title attribute set in the input document is:
      None

All the pages of the document will be cropped.

Original full page sizes, in PDF format (lbrt):
    1   rot = 0      RectangleObject([0, 0, 432, 1584])
    2   rot = 0      RectangleObject([0, 0, 432, 1584])
    3   rot = 0      RectangleObject([0, 0, 432, 1584])
    4   rot = 0      RectangleObject([0, 0, 432, 1584])

Copied these items from the document catalog:
   /Type
Skipped copy of these items from the document catalog:
   /Pages

The document was not previously cropped by pdfCropMargins.

Writing out the PDF with the CropBox and MediaBox redefined.

Using Ghostscript to calculate the bounding boxes.

The bounding boxes are:
     1   [0.0, 1393.920053, 432.000016, 1576.80006]
     2   [0.0, 1375.200052, 432.000016, 1576.80006]
     3   [0.0, 1375.200052, 432.000016, 1576.80006]
     4   [0.0, 1375.200052, 432.000016, 1576.80006]

New full page sizes after cropping, in PDF format (lbrt):
    1    RectangleObject([0, 1393.92005, 432, 1584])
    2    RectangleObject([0, 1375.20005, 432, 1584])
    3    RectangleObject([0, 1375.20005, 432, 1584])
    4    RectangleObject([0, 1375.20005, 432, 1584])

Writing the cropped PDF file.

Finished this run of pdfCropMargins.

The original file MAD2_28_31.pdf

The cropped file MAD2_28_31_cropped.pdf

Why its cropping one and not the other. Both the files are made in the same way using word document to pdf.

abarker commented 4 years ago

The default is to use pdftoppm to render the pages to .ppm files and then compute the crops from those images. The -gsr option works just the same way, except that it uses Ghostscript to render the document to .ppm files rather than using pdftoppm. The --gsBbox option is equivalent to the -gs option and does not directly render to .ppm files at all. It calls Ghostscript to compute the bounding boxes directly and return the results (and does not work on scanned documents).

I'm not sure why some files from the same source would work with -gs and some would not. I'll look into it.

sant527 commented 4 years ago

Thank you. Since mine is not a scanned document, i prefer to use -gs it requires less space in tmp and also time. Kindly have a look at the files

abarker commented 4 years ago

It's difficult to determine exactly what's happening, since with -gs Ghostscript is essentially being used as a black box to compute the bounding boxes. I don't know the internals of its algorithm. Ghostscript is apparently detecting some kind of PDF object near the bottoms of pages in the documents that aren't cropping correctly with -gs. This object isn't affecting the rendered image versions, though. I noticed that when I do a pre-crop of 6bp on the bottom of the document it crops as expected: pdf-crop-margins -v -gs -p4 100 0 100 100 -ap4 0 6 0 0 ADI_41_43.pdf.

You're also using a fairly old version of pdfCropMargins, but that doesn't seem to be causing this issue.

sant527 commented 4 years ago

-ap4 0 6 0 0 option worked. (pre cropping a bit before). But hope it will not crop if text is there within 6