Open kym6464 opened 1 year ago
I saw this behavior on a project yesterday - like you, I wasn't expecting that output in the file names. I checked generators.py
to look at the counter_generator
function. If you look more closely at the output file names, it's not duplicating page numbers - rather, it's appending the number of the thread that handles the page conversion.
A simple fix is to change this in generators.py
:
@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)
to:
@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(suffix)
Looks like there's a PR out waiting on merge to do just that and a bit more.
Describe the bug
Given a multi-page PDF, the page number is encoded twice in the output file name: once by pdf2image and again by pdftoppm/pdftocairo.
To Reproduce Steps to reproduce the behavior:
(1) Download multipage.pdf
(2) Run this code from the same directory as multipage.pdf:
(3) The previous step should produce 10 JPG files. Notice the filename of each follows format:
{PPM-root}{PPPP}-{number}.jpg
Expected behavior
Filenames should only have the page number encoded once (which the pdfto* already handles):
{PPM-root}-{number}.jpg
Screenshots
File tree showing outputs for pdf2image, pdftoppm, and pdftocairo:
Desktop (please complete the following information):
Workaround
I think the issue is with counter_generator. If we pass a generator for
output_file
, then counter_generator is never called and we can produce the expected outputs: