Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.51k stars 187 forks source link

Wrong page range given: the first page (1) can not be after the last page (0). #234

Open camipozas opened 2 years ago

camipozas commented 2 years ago

Describe the bug I am running an image in Docker to read a pdf, convert it to image and later to text (there are scanned documents) and I get the following error, does anyone know why? I can't share the document :(

To Reproduce

  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path
    raise ValueError
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/build/app/main.py", line 67, in <module>
    main()
  File "/opt/build/app/main.py", line 48, in main
    text_contract = read_pdf(contract)
  File "/opt/build/app/main.py", line 26, in read_pdf
    images_from_path = convert_from_path(
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path
    raise PDFPageCountError(
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
Syntax Error: Gen inside xref table too large (bigger than INT_MAX)
Syntax Error: Invalid XRef entry 3
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).

Desktop (please complete the following information):

Additional context Dockerfile

FROM python:3.9
ENV LANG en_US.UTF-8

WORKDIR /opt/build

RUN apt update && apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev pkg-config poppler-utils

ADD requirements.txt requirements.txt
RUN pip install -r requirements.txt
# Copy env variables
ADD .env .env

# trained models
ADD tessdata/ tessdata/
ENV TESSDATA_PREFIX /opt/build/tessdata/

ADD app/ app/
RUN mkdir input

ENTRYPOINT ["python"]
CMD ["app/main.py"]
Belval commented 2 years ago

Thank you for taking the time to fill the issue template, it's much easier to help.

Is this only with one or a few PDFs?

Also, can you run pdftoppm -r 200 -jpeg your_file.pdf out and see if that also gives you an error?

camipozas commented 2 years ago

Hello, I was doing analysis of the pdfs that gave me an error and they all had docusign, but it also happens that others with docusing usually run correctly. I don't know how to upgrade poppler-utils in docker. I'd read this before, Pdf2Image library failing to read pdf signed using docusign

camipozas commented 2 years ago

Hello, I solved the mistake. The solution is create an ubuntu image, then install python (my case) and then install my things. It's the only way for now... When I get inside the container I saw this version of poppler:

poppler-utils:
  Installed: 20.09.0-3.1
  Candidate: 20.09.0-3.1
  Version table:
 * 20.09.0-3.1 500
        500 http://deb.debian.org/debian bullseye/main amd64 Packages
        100 /var/lib/dpkg/status

And I know that I need +21.03.00...so after doing the solution, the image have:

poppler-utils:
  Installed: 22.02.0-2
  Candidate: 22.02.0-2
  Version table:
 * 22.02.0-2 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages
        100 /var/lib/dpkg/status

If anyone has a question please contact me, happy to help.

faltunik commented 2 years ago

What I still don't understand what cause miscount?

camipozas commented 2 years ago

@faltunik sorry I don't know what cause the issue in details..I only know a priori the cause and the solution

Belval commented 2 years ago

This is a poppler issue unfortunately so there is not much that can be done on my side. I might add a check that raises a warning so that people are aware.

camipozas commented 2 years ago

if you want can I add the documentation to your project. I can make a fork and then upload the PR.

Belval commented 2 years ago

I appreciate the offer, but I am not sure what's the best way/place to document this yet.

It could be:

For the code warning it would using the warning module (https://docs.python.org/3/library/warnings.html#warnings.warn):

warnings.warn(f"Detected popper version {poppler_version_major}.{poppler_version_minor} is known to fail on some PDFs in rare cases")

Code warning is more intrusive and might be overkill depending on how common this issue is.

puneetjindal commented 5 months ago

@camipozas How do you check whether a particular pdf is a scanned pdf or not?