PDFPageCountError: Unable to get page count on Linux

stockpilothq commented 1 year ago

Hi everyone,

I've set up a project which uses pdf2image. I installed Poppler with Brew and it works locally (on my MacOS) like a charm.

Production on the other hand drives me crazy. I setup a Dockerfile and added the following command: RUN apt update && apt-get install -y poppler-utils

CLI outputs:

$ find / -name poppler-utils
/usr/share/lintian/overrides/poppler-utils
/usr/share/doc/poppler-utils

$ find / -name poppler
/usr/local/lib/python3.10/dist-packages/poppler
/usr/share/poppler

$ pdfinfo
pdfinfo version 22.02.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfinfo [options] <PDF-file>
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -box                 : print the page bounding boxes
  -meta                : print the document metadata (XML)
  -custom              : print both custom and standard metadata
  -js                  : print all JavaScript in the PDF
  -struct              : print the logical document structure (for tagged files)
  -struct-text         : print text contents along with document structure (for tagged files)
  -isodates            : print the dates in ISO-8601 format
  -rawdates            : print the undecoded date strings directly from the PDF file
  -dests               : print all named destinations in the PDF
  -url                 : print all URLs inside PDF objects (does not scan text content)
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information

Everything seems to be installed correctly. But the moment I try to convert a pdf_from_path I retrieve the following error:

PDFPageCountError: Unable to get page count. Internal Error: Cannot handle URI 'https://project.blob.core.windows.net/media/invoices/pdf/51200617.pdf'.

Python code:

        try:
            file_path = f'https://project.blob.core.windows.net/media/invoices/pdf/{file_name}'
            images = convert_from_path(file_path, 500)
            n=0
            cleaned_name = str(file_name)[:-4]
            for img in images:
                blob = BytesIO()
                img.save(blob, 'JPEG')
                img_entry = ImageEntry.objects.create(invoice=invoice)
                img_entry.img_file.save(f'{cleaned_name}-{n}.jpg', File(blob), save=True) 
                n+=1

        except PDFInfoNotInstalledError as err:
            print(f"PDFInfoNotInstalledError: {err}")   

        except PDFPageCountError as err:
            print(f"PDFPageCountError: {err}")  

        except PDFSyntaxError as err:
            print(f"PDFSyntaxError: {err}") 

        except Exception as err:
            print(f"Exception: {err}")

Docker-compose:
version: '3.4'

services:
  project:
    image: project.azurecr.io/project:latest
    platform: linux/x86_64
    build:
      context: .
      dockerfile: ./Dockerfile
    ports:
      - 8000:8000

The answers on this error I find by search are all related to poppler_path and windows, which does not help. Hope someone can tell me with this issue.

Thanks in advance.

Belval commented 1 year ago

Internal Error: Cannot handle URI 'https://project.blob.core.windows.net/media/invoices/pdf/51200617.pdf'.

Seems like you are trying to parse a PDF hosted on a webpage, this is not supported. You need to download the file locally (to disk of memory) before trying to parse it.

stockpilothq commented 1 year ago

Works, thanks so much!

Belval / pdf2image

PDFPageCountError: Unable to get page count on Linux #244