LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

join_ocred_pdf failing due to "cannot read an empty file" #34

Closed danirui closed 2 years ago

danirui commented 2 years ago

The error I get is "PyPDF2.errors.PdfReadError: Cannot read an empty file". I experimented with the first 2 pages of this pdf; individually the two pages OCR'ed fine (neither page was empty, and the OCR'ed text was not empty either), but when I tried to do the 2 pages together, it gave me

Traceback (most recent call last): File "/usr/local/bin/pdf2pdfocr.py", line 1526, in pdf2ocr.ocr() File "/usr/local/bin/pdf2pdfocr.py", line 717, in ocr self.join_ocred_pdf() File "/usr/local/bin/pdf2pdfocr.py", line 952, in join_ocred_pdf pdf_merger.append(PyPDF2.PdfFileReader(text_pdf_file, strict=False)) File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1856, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 277, in init self.read(stream) File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1301, in read raise PdfReadError("Cannot read an empty file") PyPDF2.errors.PdfReadError: Cannot read an empty file.

Side note, on the successful runs it gave me the warnings

UserWarning: isString is deprecated and will be removed in PyPDF2 2.0.0. [_utils.py:76] UserWarning: namedDestinations will be removed in PyPDF2 2.0.0. Use named_destinations instead. [_reader.py:519] UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]

LeoFCardoso commented 2 years ago

Thank you. What parameters did you use for pdf2pdfocr?

danirui commented 2 years ago

I'm not sure what you mean. I just did everything discussed in "Installation" (on a Mac) and then ran "docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./sample_file.pdf". This produced

Unable to find image 'leofcardoso/pdf2pdfocr:latest' locally latest: Pulling from leofcardoso/pdf2pdfocr d5fd17ec1767: Pull complete b108d4e24732: Pull complete eb7093159f91: Pull complete 6110e8612067: Pull complete 9ccb3d8c19eb: Pull complete 610159715c64: Pull complete 4f4fb700ef54: Pull complete 550ba38ca3cf: Pull complete 6761f02c7165: Pull complete Digest: sha256:106c81e6bf87599d9e9e10ae8a7d7a5db493110d230eb03cf14bdb3cdbae80b5 Status: Downloaded newer image for leofcardoso/pdf2pdfocr:latest WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

Once I did this, I saw in the Docker app that "leofcardoso/pdf2pdfocr" with tag "latest" showed up in Images, and the program started running. In all my experiments the OCR seemed to work (the "[LOG] Waiting for OCR to complete." went through all the (non-blank?) pages), but just the ending join_ocred_pdf seemed to fail for certain input pdfs.

LeoFCardoso commented 2 years ago

Please let me know it latest commit fix this issue.

danirui commented 2 years ago

There are definitely improvements; my previous experiments with medium length files (2-4 pages) worked, but the full pdfs (100s of pages) did not work (same error).

LeoFCardoso commented 2 years ago

Can you please upload one of the failed PDFs? The first one worked well wtih this image.

danirui commented 2 years ago

My two pdfs are here and here. This is the output for the one I linked above

danielrui@Daniels-MacBook-Air OCRdocs % docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./coxPrimes.pdf WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

File: ./coxPrimes.pdf [2022-07-02 03:01:08.375058] [DEBUG] Tesseract can 'textonly_pdf': True [2022-07-02 03:01:08.411708] [DEBUG] Tesseract version: 4 [2022-07-02 03:01:08.603280] [DEBUG] Pdftoppm version: 0.86.1 [2022-07-02 03:01:08.649852] [DEBUG] Qpdf version: 9.1.1 [2022-07-02 03:01:08.650521] [DEBUG] Temp dir is /tmp/pdf2pdfocr_LT3DG/ [2022-07-02 03:01:08.650628] [DEBUG] Prefix is LT3DG [2022-07-02 03:01:08.650789] [DEBUG] Script dir is /usr/local/bin/ [2022-07-02 03:01:08.651481] [DEBUG] Parallel operations will use 4 CPUs [2022-07-02 03:01:08.681972] [LOG] Welcome to pdf2pdfocr version 1.11.2 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr [2022-07-02 03:01:08.922957] [LOG] Input file /home/docker/coxPrimes.pdf: type is application/pdf [2022-07-02 03:01:10.253327] [DEBUG] User conversion params: [2022-07-02 03:01:10.290914] [DEBUG] Output file: /home/docker/coxPrimes-OCR.pdf for PDF and /home/docker/coxPrimes-OCR.pdf.txt for TXT [2022-07-02 03:01:10.306598] [LOG] Converting input file to images... [2022-07-02 03:20:29.498341] [LOG] Checking blank pages [2022-07-02 03:21:22.677630] [LOG] Starting OCR with tesseract... [2022-07-02 03:21:27.274712] [LOG] Waiting for OCR to complete. 0/363 pages completed...

[2022-07-02 04:05:28.903960] [LOG] Waiting for OCR to complete. 357/363 pages completed... [2022-07-02 04:05:30.629944] [LOG] OCR completed [2022-07-02 04:05:30.668267] [DEBUG] We have 363 ocr'ed files Traceback (most recent call last): File "/usr/local/bin/pdf2pdfocr.py", line 1526, in pdf2ocr.ocr() File "/usr/local/bin/pdf2pdfocr.py", line 717, in ocr self.join_ocred_pdf() File "/usr/local/bin/pdf2pdfocr.py", line 952, in join_ocred_pdf pdf_merger.append(PyPDF2.PdfFileReader(text_pdf_file, strict=False)) File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 239, in init self.read(stream) File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 913, in read raise PdfReadError("Cannot read an empty file") PyPDF2.errors.PdfReadError: Cannot read an empty file

LeoFCardoso commented 2 years ago

Hello @danirui Please try again with the latest docker image. Both test files worked in my container. Hope to hear from you.

Thanks

danirui commented 2 years ago

Unfortunately I still get the same error. Maybe it is just that I have configured/setup my machine extraordinarily poorly. EDIT SLIGHTLY LATER: I reset Docker with a higher memory and swap resource allocation, and tested with a ~40 page excerpt of the pdfs the error went away! So I think my machine/Docker was running out of memory and that caused some issues.

LeoFCardoso commented 2 years ago

Good news. What was your previous memory / CPU configuration?

danirui commented 2 years ago

The minimum possible, which was 1GB memory, 512MB swap, and 8GB disk image. I doubled each of these, and it worked.

LeoFCardoso commented 2 years ago

Ok. I'll close this issue. Thank you!