Closed danirui closed 2 years ago
Thank you. What parameters did you use for pdf2pdfocr?
I'm not sure what you mean. I just did everything discussed in "Installation" (on a Mac) and then ran "docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./sample_file.pdf". This produced
Unable to find image 'leofcardoso/pdf2pdfocr:latest' locally latest: Pulling from leofcardoso/pdf2pdfocr d5fd17ec1767: Pull complete b108d4e24732: Pull complete eb7093159f91: Pull complete 6110e8612067: Pull complete 9ccb3d8c19eb: Pull complete 610159715c64: Pull complete 4f4fb700ef54: Pull complete 550ba38ca3cf: Pull complete 6761f02c7165: Pull complete Digest: sha256:106c81e6bf87599d9e9e10ae8a7d7a5db493110d230eb03cf14bdb3cdbae80b5 Status: Downloaded newer image for leofcardoso/pdf2pdfocr:latest WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Once I did this, I saw in the Docker app that "leofcardoso/pdf2pdfocr" with tag "latest" showed up in Images, and the program started running. In all my experiments the OCR seemed to work (the "[LOG] Waiting for OCR to complete." went through all the (non-blank?) pages), but just the ending join_ocred_pdf seemed to fail for certain input pdfs.
Please let me know it latest commit fix this issue.
There are definitely improvements; my previous experiments with medium length files (2-4 pages) worked, but the full pdfs (100s of pages) did not work (same error).
Can you please upload one of the failed PDFs? The first one worked well wtih this image.
My two pdfs are here and here. This is the output for the one I linked above
danielrui@Daniels-MacBook-Air OCRdocs % docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./coxPrimes.pdf WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
File: ./coxPrimes.pdf [2022-07-02 03:01:08.375058] [DEBUG] Tesseract can 'textonly_pdf': True [2022-07-02 03:01:08.411708] [DEBUG] Tesseract version: 4 [2022-07-02 03:01:08.603280] [DEBUG] Pdftoppm version: 0.86.1 [2022-07-02 03:01:08.649852] [DEBUG] Qpdf version: 9.1.1 [2022-07-02 03:01:08.650521] [DEBUG] Temp dir is /tmp/pdf2pdfocr_LT3DG/ [2022-07-02 03:01:08.650628] [DEBUG] Prefix is LT3DG [2022-07-02 03:01:08.650789] [DEBUG] Script dir is /usr/local/bin/ [2022-07-02 03:01:08.651481] [DEBUG] Parallel operations will use 4 CPUs [2022-07-02 03:01:08.681972] [LOG] Welcome to pdf2pdfocr version 1.11.2 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr [2022-07-02 03:01:08.922957] [LOG] Input file /home/docker/coxPrimes.pdf: type is application/pdf [2022-07-02 03:01:10.253327] [DEBUG] User conversion params: [2022-07-02 03:01:10.290914] [DEBUG] Output file: /home/docker/coxPrimes-OCR.pdf for PDF and /home/docker/coxPrimes-OCR.pdf.txt for TXT [2022-07-02 03:01:10.306598] [LOG] Converting input file to images... [2022-07-02 03:20:29.498341] [LOG] Checking blank pages [2022-07-02 03:21:22.677630] [LOG] Starting OCR with tesseract... [2022-07-02 03:21:27.274712] [LOG] Waiting for OCR to complete. 0/363 pages completed...
[2022-07-02 04:05:28.903960] [LOG] Waiting for OCR to complete. 357/363 pages completed... [2022-07-02 04:05:30.629944] [LOG] OCR completed [2022-07-02 04:05:30.668267] [DEBUG] We have 363 ocr'ed files Traceback (most recent call last): File "/usr/local/bin/pdf2pdfocr.py", line 1526, in
pdf2ocr.ocr() File "/usr/local/bin/pdf2pdfocr.py", line 717, in ocr self.join_ocred_pdf() File "/usr/local/bin/pdf2pdfocr.py", line 952, in join_ocred_pdf pdf_merger.append(PyPDF2.PdfFileReader(text_pdf_file, strict=False)) File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 239, in init self.read(stream) File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 913, in read raise PdfReadError("Cannot read an empty file") PyPDF2.errors.PdfReadError: Cannot read an empty file
Hello @danirui Please try again with the latest docker image. Both test files worked in my container. Hope to hear from you.
Thanks
Unfortunately I still get the same error. Maybe it is just that I have configured/setup my machine extraordinarily poorly. EDIT SLIGHTLY LATER: I reset Docker with a higher memory and swap resource allocation, and tested with a ~40 page excerpt of the pdfs the error went away! So I think my machine/Docker was running out of memory and that caused some issues.
Good news. What was your previous memory / CPU configuration?
The minimum possible, which was 1GB memory, 512MB swap, and 8GB disk image. I doubled each of these, and it worked.
Ok. I'll close this issue. Thank you!
The error I get is "PyPDF2.errors.PdfReadError: Cannot read an empty file". I experimented with the first 2 pages of this pdf; individually the two pages OCR'ed fine (neither page was empty, and the OCR'ed text was not empty either), but when I tried to do the 2 pages together, it gave me
Side note, on the successful runs it gave me the warnings