Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.51k stars 187 forks source link

PIL.UnidentifiedImageError #239

Open camipozas opened 1 year ago

camipozas commented 1 year ago

Describe the bug Different behavior on my computer to AWS EC2 instance m5.xlarge.

Expected behavior That they have the same behavior since it works on my computer, however when I execute it it cannot find the images.

AWS Log

Process Process-1:
Traceback (most recent call last):
  File "/opt/build/app/read_contracts.py", line 67, in read_contracts
    text_contract = read_pdf(filepath)
  File "/opt/build/app/read_contracts.py", line 27, in read_pdf
    images_from_path = convert_from_path(pdf_path=pdf,
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 218, in convert_from_path
    images += _load_from_output_folder(
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 517, in _load_from_output_folder
    images.append(Image.open(os.path.join(output_folder, f)))
  File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 3123, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpqo3mn0om/2d473b9f-5b6c-46f0-9220-a4bf51124f6e-03.ppm'

Desktop (please complete the following information):

Additional context

Function error

def read_pdf(pdf):
    """
    It takes a pdf file, converts it to images, and then converts those images to text
    :param pdf: The path to the PDF file you want to convert
    :return: A string with the text of the pdf
    """
    full_text = ''
    with tempfile.TemporaryDirectory() as path:
        images_from_path = convert_from_path(pdf_path=pdf,
                                             dpi=350,
                                             output_folder=path)

        for page in tqdm(images_from_path):
            full_text += image_to_text(page, lang='spa')
    return full_text

I printed the filenames to see if it was a path issue but it displays correctly. Additionally I am using multiprocessing, again in local it works but in the instance it does not.

camipozas commented 1 year ago

@jedwards94

Belval commented 1 year ago

Is this only happening with a single PDF? If you run pdftoppm -r 200 -jpeg your_file.pdf out does it show any warnings?

asanaa8 commented 1 year ago

same error as @camipozas

camipozas commented 1 year ago

@asanaa8 I fixed with this