PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'C:\Users\cdragomir2\Desktop\dataiku\Non Phub Samples\New folder (3)\007-084841-1 to 31 Dec'22': No error.

Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

MIT License

1.65k stars 195 forks source link

PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'C:\Users\cdragomir2\Desktop\dataiku\Non Phub Samples\New folder (3)\007-084841-1 to 31 Dec'22': No error. #251

Open Crispisu opened 1 year ago

Crispisu commented 1 year ago

Hi All, I am trying to use pdf2image, but I am getting this error:

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\user_name\Desktop\folder_name\folder2_name\folder3_name\007-084841-1 to 31 Dec'22': No error.

It is confusing as it doesn't give any error, it just says 'No error'

My code is:

doc = convert_from_path("C:\\Users\\user_name\\Desktop\\folder_name\\folder2_name\\folder3_name\\007-084841-1 to 31 Dec'22")
path, fileName = os.path.split("C:\\Users\\user_name\\Desktop\\folder_name\\folder2_name\\folder3_name\\007-084841-1 to 31 Dec'22")
fileBaseName, fileExtension = os.path.splitext(fileName)

for page_number, page_data in enumerate(doc):
    txt = pytesseract.image_to_string(Image.fromarray(page_data)).encode("utf-8")
    print("Page # {} - {}".format(str(page_number),txt))

Can anyone help me please?

jjbiggins commented 1 year ago

I investigated this a bit. More information would be helpful to nail it down.

What version of pdf2image are you use? And, what python version?

I don't have an easily accessible Windows machine, so I didn't confirm, but looks like Popen in pdfinfo func is throwing an error. I couldn't replicate it, but I know issues in the past occur because pdfinfo was not in PATH. So, I would check that it's there first.

Aside from that, it appears stderr isn't being handled correctly. I believe if stderr=PIPE was replaced with stderr=STDOUT, which is an alias to stderr, it would work.

Also, windows has the STARTUPINFO class impacts stdin,stdout,stderr on windows. In the most up-to-date code in the repo, you'll notice that the process instances are created using the STARTUPINFO.

The pdfinfo function has evolve dover the various versions of pdf2image, as has subprocess evolved, particular for windows, from 3.7 till now. So, knowing those would help narrow down the issue.

Crispisu commented 1 year ago

@jjbiggins Thank you so much for looking into it. I have just managed to figure it out, it was just a stupid mistake on my side and even if it I feel embarrassed to say what it was...I will say it in case someone else makes this stupid mistake... Forgot to add file extension! :( Thank you once again for your help!

jjbiggins commented 1 year ago

I see. I was curious about that filename.

To me, that obviously makes sense why it would raise the PDFPageCountError. However, the error message, "No error", seems undesirable.

For example, in your case, where the file doesn't exist, due to the extension being omitted. I would expect an error such as:

pdf2image.exceptions.PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'hello.pdf': No such file or directory.

Now, I generated that on macOS with python3.11 and the most recent pdf2image code. I would have to defer to someone more knowledgeable, but, intuitively, it seems like saying "No error" when there clearly is one is an issue.

However, depending on your version of pdf2image. This may have been resolved.

Crispisu commented 1 year ago

Agreed, that "No error" message was very confusing for me as well, as I had no clue how to debug. It would be great if an error message like you said would be thrown.

pdf2image version: pdf2image 1.16.2

Thank you so much!

jjbiggins commented 1 year ago

After looking into this, this message comes directly from pdfinfo binary; thus, it is dependent on the version of pdfinfo being used.

For example, if you were using the pdfinfo binary from Xpdf-4.04, no message would be displayed. However, if using pdfinfo version 22.09.0 from poppler you get the more detailed output.

In both cases, pdfinfo's uses fopen() call, to open the pdf. Throwing Errno 2, ENOENT, No such file or directory.

Only in the poppler version is the errno's description, "No such file or directory", appended to pdfinfo's error message, and, consequently, available to be captured by stderr in pdf2image.

There's not a great way to handle this.