Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.64k stars 195 forks source link

No cleanup of temporary directory in case of an error for pdftocairo #224

Closed stefan6419846 closed 2 years ago

stefan6419846 commented 2 years ago

Describe the bug

If pdf2image is used with use_pdftocairo=True, a temporary output directory is being created if I am not passing one myself. This directory will not be cleaned up when an error occurs during runtime.

To Reproduce

Steps to reproduce the behavior:

  1. Generate a faulty PDF (example: PDF with a PNG image that includes a color profile, as Pillow is not able to read such files).

  2. pdf2image.convert_from_path(pdf_path, use_pdftocairo=True)

  3. Error is produced:

    Traceback (most recent call last):
     File "/home/user/path/to/test/file.py", line 43, in __main__
       print(pdf2image.convert_from_path(path, use_pdftocairo=True))
     File "/home/user/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 212, in convert_from_path
       output_folder, uid, final_extension, paths_only, in_memory=auto_temp_dir
     File "/home/user/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 504, in _load_from_output_folder
       images.append(Image.open(os.path.join(output_folder, f)))
     File "/home/user/lib/python3.6/site-packages/PIL/Image.py", line 2818, in open
       raise IOError("cannot identify image file %r" % (filename if filename else fp))
    OSError: cannot identify image file '/tmp/tmpvn5xet47/c1b1aabe-b0d6-4127-bc15-e30987d6b948-6.png'
  4. Directory /tmp/tmpvn5xet47/ is not being removed.

Expected behavior

pdf2image should be more transparent about the side-effects of the use_pdftocairo option. Ideally auto-created temporary directories would always be cleaned up in case of an error or at least accessible to the caller in some way to perform manual cleanup; if preferred, there might be a new option which optionally performs this cleanup.

Desktop

Belval commented 2 years ago

So to clarify, if you don't use pdftocairo the directory is cleaned up even if pdf2image raises an exception?

stefan6419846 commented 2 years ago

No:

Belval commented 2 years ago

Ok now I remember (I wrote that part of the code a while ago so my memory can be spotty at times). This behaviour is caused by pdftocairo not having stdout. It will automatically generate files on disk instead of in memory which is the pdftoppm behaviour.

I agree that the temp directory should be discarded on exception, I will add a try/except to handle this edge case.

Belval commented 2 years ago

The PR should fix your issue and make the behaviour coherent, will be merging and releasing this soon.