Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.51k stars 187 forks source link

`Page rot` metadata and `size` param interact incorrectly in convert_from_path() #272

Open Crowfunder opened 7 months ago

Crowfunder commented 7 months ago

Describe the bug Attempting to convert a pdf with a size param, with pdf Page rot rotation metadata changing its original orientation (90, 270 etc) forces the scanned pages onto i.e a horizontal template, despite it being vertical. Any PDF viewer displays the pdf, correctly, as a vertical one. As a result of this issue, half of the page is cut off, and its remainder is squished.

To Reproduce Steps to reproduce the behavior:

import numpy as np
import cv2
from pdf2image import convert_from_path, pdfinfo_from_path

pdf_path = 'our pdf path'

# Return PDF rotation from its metadata
rotation = pdf2image.pdfinfo_from_path(pdf_path)['Page rot'])
print(f'PDF rotation: {rotation}') 

# Get the pdf pages' images
images = convert_from_path(pdf_path, 600, size=(1653, 2338))

# Write all page images to files
i=0
for image in images:
    i+=1
    cv2.imwrite(f'page{i}.jpg', np.array(image)) 

Expected behavior Rotation metadata and size param get applied correctly.

Screenshots An example page from a pdf with rotation

Desktop (please complete the following information):

Notes: I'm well aware that it's probably an issue with Poppler, not with pdf2image, but there may be some walkaround, or some info may be gathered here for a Poppler issue.

Theoretically the issue will be resolved if the rotation gets applied into the file permanently, instead of being embedded in metadata.

Belval commented 7 months ago

Could you try to manually run popper on the asset? Something like:

pdftoppm -r 200 your_asset.pdf out

As you pointed out this might be an issue with poppler but I'd like to confirm first. You can also try to use pdftocairo and see if the orientation is correct in that case.

Crowfunder commented 7 months ago

pdftoppm -r 200 your_asset.pdf out This one worked perfectly.

Belval commented 7 months ago

Ok so the issue is with pdf2image somehow. Can you share the asset?

Crowfunder commented 7 months ago

Forgot to mention that pdftocairo works fine. the pdf in question https://wormhole.app/kRZQl#5lMmzZ6BtD7RFIGOOaTOsw

tenberg commented 5 months ago

I just ran into a similar issue also with dpi not being set correctly. Not sure if this helps the debug process, but in my code I decided to the following: page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=300)

and saw this error in PIL/TiffImagePlugin.py: ifd[RESOLUTION_UNIT] = 2 ifd[X_RESOLUTION] = dpi[0] ifd[Y_RESOLUTION] = dpi[1]

which led me to believe dpi should be a 2 element list. So I then tried: page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])

and when I checked the .tif in Preview, the resolution was correct at 300dpi instead of 72.

Just to sum up, I converted a 11 x 8.5 pdf to tiff using the following lines and removed dpi=300 from convert_from_path and moved it to save as a 2 element list: page = convert_from_path(f"{working_path}{pdf}", size=(3300, 2550)) page[0].save(f"{working_path}{pdf[0:pdf.find('.')]}.tif", "TIFF", dpi=[300, 300])

Hope this helps.