Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.51k stars 187 forks source link

Inconsistent results between servers with the same code and PDF file #260

Open zekriHichem opened 1 year ago

zekriHichem commented 1 year ago

Describe the bug I am running the same Python 3.8.6 code with the same PDF file and the same version of the pdf2image 1.16.2 library on two different servers. On one server, the code works perfectly and produces the expected output (a list of PIL image objects), but on the other server, the code returns an empty list.

I have tried checking the versions of Python and all dependencies (including pdf2image) on both servers to ensure that they are the same, but the issue persists. I have also tried running the code on both servers with a different PDF file to see if the issue is specific to the PDF file, but this did not help.

I am not seeing any error messages or logs that indicate what might be causing the issue on the server that returns an empty list.

Can you provide any guidance on how to troubleshoot this issue further? Is there anything specific about the pdf2image library or the environment that might be causing inconsistencies in the output between servers?

To Reproduce

Unfortunately, I am unable to reproduce the issue consistently. When running the same code with the same PDF file and the same version of the pdf2image library on two different servers, one server produces the expected output (a list of PIL image objects), but the other server returns an empty list.

If you have any guidance on how to troubleshoot this issue further, or any ideas as to what might be causing inconsistencies in the output between servers, I would greatly appreciate it.

Expected behavior The code should produce the same output (a list of PIL image objects) on both servers.

Actual Results: On one server, the code works perfectly and produces the expected output. On the other server, the code returns an empty list.

code The b64_pdf is not empty.

from pdf2image import convert_from_bytes
file = BytesIO(base64.b64decode(b64_pdf.encode(UTF_8)))
logger.info(f"------------01-----------{file == None}")
try:
       images = convert_from_bytes(file.read(), fmt="JPEG" ,dpi=300, thread_count=4)
       logger.info(f"-----------02----------{images == None}")
       logger.info(f"-----------02----------{images == []}")
except Exception as e:
         logger.exception(f"Error on given input : {e}")
         raise DecompressionBombError(
                message="Exceeded the max pixel count for Pillow"
            )

Desktop (please complete the following information):

zekriHichem commented 1 year ago

It is possible that the issue you are experiencing is related to a difference in the amount of memory allocated on the two servers. When using the pdf2image library to extract images from a PDF, this can take a considerable amount of memory depending on the size of the PDF and the number of images to be extracted.

It is important to check that the servers have similar memory specifications and that the same amount of memory is allocated to the Python process when running your code. If one of the servers has less memory or if the amount allocated to the Python process is lower, this could explain why the code fails on that server.

It was my problem, but I augmented the memory and that worked for me.

But it doesn't return any error or exception, it's just an empty list. Normally, if there was a memory problem, it would have returned an exception of the type 'out of memory' or something like that.

andrew-cybsafe commented 4 months ago

I've also come across this. Looking through the poppler bug list, it could be related to https://gitlab.freedesktop.org/poppler/poppler/-/issues/1403.