Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.51k stars 187 forks source link

Page number duplicated in multi-page PDFs #247

Open kym6464 opened 1 year ago

kym6464 commented 1 year ago

Describe the bug

Given a multi-page PDF, the page number is encoded twice in the output file name: once by pdf2image and again by pdftoppm/pdftocairo.

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

(1) Download multipage.pdf

(2) Run this code from the same directory as multipage.pdf:

import pathlib
from pdf2image import convert_from_path

pdf_file = pathlib.Path(r"./multipage.pdf")
convert_from_path(pdf_file, output_folder=".", output_file=pdf_file.stem, fmt='jpeg')

(3) The previous step should produce 10 JPG files. Notice the filename of each follows format: {PPM-root}{PPPP}-{number}.jpg

Expected behavior

Filenames should only have the page number encoded once (which the pdfto* already handles): {PPM-root}-{number}.jpg

Screenshots

File tree showing outputs for pdf2image, pdftoppm, and pdftocairo:

│   driver.py
│   multipage.pdf
│
├───output_pdf2image
│       multipage0001-01.jpg
│       multipage0001-02.jpg
│       multipage0001-03.jpg
│       multipage0001-04.jpg
│       multipage0001-05.jpg
│       multipage0001-06.jpg
│       multipage0001-07.jpg
│       multipage0001-08.jpg
│       multipage0001-09.jpg
│       multipage0001-10.jpg
│
├───output_pdftocairo
│       multipage-01.jpg
│       multipage-02.jpg
│       multipage-03.jpg
│       multipage-04.jpg
│       multipage-05.jpg
│       multipage-06.jpg
│       multipage-07.jpg
│       multipage-08.jpg
│       multipage-09.jpg
│       multipage-10.jpg
│
└───output_pdftoppm
        multipage-01.jpg
        multipage-02.jpg
        multipage-03.jpg
        multipage-04.jpg
        multipage-05.jpg
        multipage-06.jpg
        multipage-07.jpg
        multipage-08.jpg
        multipage-09.jpg
        multipage-10.jpg

Desktop (please complete the following information):

Workaround

I think the issue is with counter_generator. If we pass a generator for output_file, then counter_generator is never called and we can produce the expected outputs:

import pathlib
from pdf2image import convert_from_path

pdf_file = pathlib.Path(r"./multipage.pdf")

def constant_generator():
    while True:
        yield pdf_file.stem

convert_from_path(pdf_file, output_folder=".", output_file=constant_generator(), fmt='jpeg')
jerryrelmore commented 6 months ago

I saw this behavior on a project yesterday - like you, I wasn't expecting that output in the file names. I checked generators.py to look at the counter_generator function. If you look more closely at the output file names, it's not duplicating page numbers - rather, it's appending the number of the thread that handles the page conversion.

A simple fix is to change this in generators.py:

@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
    """Returns a joined prefix, iteration number, and suffix"""
    i = 0
    while True:
        i += 1
        yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)

to:

@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
    """Returns a joined prefix, iteration number, and suffix"""
    i = 0
    while True:
        i += 1
        yield str(prefix) + str(suffix)

Looks like there's a PR out waiting on merge to do just that and a bit more.