Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.66k stars 195 forks source link

Lossless conversion #266

Open 2V3EvG4LMJFdRe opened 1 year ago

2V3EvG4LMJFdRe commented 1 year ago

I need a reliable script that converts images to PDF and then another to revert the process.

To Reproduce

  1. My first script converts a set of images that are 34.3mb in total into a 34.3mb pdf file with img2pdf.
  2. My second script is using pdf2image to convert the pdf file "back" into images:
export PATH=/usr/local/bin:$PATH
/usr/local/bin/python3 <<'EOF' - "$@"

from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

import tempfile
with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/Users/user/test.pdf', thread_count=2, dpi=300, fmt='png', use_pdftocairo=True, jpegopt={"quality": 100, "optimize": True}, output_folder='/Users/user/testpdf')
EOF

Describe the bug

The result is a series of ppm files which would amount to a 500mb pdf. Lossless quality, but too big. Using fmt='jpeg' outputs files which would amount to a 24.3mb pdf, showing its drop in quality when zooming in. Is there a way to create better quality jpeg files, closer to the original files?

2V3EvG4LMJFdRe commented 1 year ago

I've been doing some tests and even though I could jpegopt={"quality": 100, "optimize": True} it seems that the ppm export isn't actually lossless to begin with:

Original PDF

pdf

pdf2image PPM

ppm

It's noticeably more blurry.

2V3EvG4LMJFdRe commented 1 year ago

It's a lot better with use_pdftocairo=True. Either jpg or png, outputs a much better image, but it's still not the original files. I wonder if such a process is possible at all.

lbr991 commented 1 year ago

Is there a way to do lossless compression? If not, what is the way to go as close to lossless other than specifying png output?