LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

No PDF files generated after OCR. This is not expected. Aborting. #4

Closed ericmoret closed 6 years ago

ericmoret commented 6 years ago

Something seems to be wrong. I am running MacOS 10.13.6 with a fresh macports installation.

[2018-07-21 12:18:53.392458] [LOG]      Input file /Users/emoret/Downloads/01-19-2017.pdf: type is application/pdf
PdfReadWarning: Multiple definitions in dictionary at byte 0xa9769 for key /Outlines [generic.py:588]
[2018-07-21 12:18:53.400228] [DEBUG]    Output file: /Users/emoret/Downloads/01-19-2017-OCR.pdf for PDF and /Users/emoret/Downloads/01-19-2017-OCR.pdf.txt for TXT
[2018-07-21 12:18:53.400349] [LOG]      Converting input file to images...
[2018-07-21 12:18:54.256488] [LOG]      Starting OCR...
[2018-07-21 12:18:54.268721] [LOG]      Waiting for OCR to complete. 0/5 pages completed...
[2018-07-21 12:18:59.271505] [LOG]      OCR completed
[2018-07-21 12:18:59.273427] [DEBUG]    We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
LeoFCardoso commented 6 years ago

Hi there. What python version? Can I have access to this PDF file ?

ericmoret commented 6 years ago

S.pdf

$ python3 --version
Python 3.4.8
LeoFCardoso commented 6 years ago

Thanks. I’m on vacation now. Will check when back to home. Please wait for about 10 days...

LeoFCardoso commented 6 years ago

Hi Eric. I could process your PDF file in my installation. Can you please post your complete command line?

ericmoret commented 6 years ago
$ pdf2pdfocr.py -i S.pdf -l fra -v
[2018-07-31 11:20:23.557974] [DEBUG]    Temp dir is /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
[2018-07-31 11:20:23.558115] [DEBUG]    Prefix is H82ML
[2018-07-31 11:20:23.558165] [DEBUG]    Script dir is /Users/emoret/bin/
[2018-07-31 11:20:23.558221] [DEBUG]    Parallel operations will use 4 CPUs
[2018-07-31 11:20:23.558324] [LOG]      Welcome to pdf2pdfocr version 1.2.3
[2018-07-31 11:20:23.564893] [LOG]      Input file /Users/emoret/Downloads/S.pdf: type is application/pdf
[2018-07-31 11:20:23.568274] [DEBUG]    Output file: /Users/emoret/Downloads/S-OCR.pdf for PDF and /Users/emoret/Downloads/S-OCR.pdf.txt for TXT
[2018-07-31 11:20:23.568386] [LOG]      Converting input file to images...
[2018-07-31 11:20:25.147699] [LOG]      Starting OCR...
[2018-07-31 11:20:25.161726] [LOG]      Waiting for OCR to complete. 0/1 pages completed...
[2018-07-31 11:20:30.165266] [LOG]      OCR completed
[2018-07-31 11:20:30.166331] [DEBUG]    We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
LeoFCardoso commented 6 years ago

Looks like tesseract could not execute with success. Is French (-l fra) correctly installed? Can you run with "-k" and post a zip file with temp files generated that contains the "prefix" (H82ML) in your last execution...

ericmoret commented 6 years ago

CIIXL-1.zip

$ port installed | grep tesseract-fra
  tesseract-fra @3.04_1 (active)
SJCMAC41CBFVH8:Downloads emoret$ pdf2pdfocr.py -i S.pdf -l fra -v -k
[2018-07-31 11:40:04.273624] [DEBUG]    Temp dir is /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
[2018-07-31 11:40:04.273757] [DEBUG]    Prefix is CIIXL
[2018-07-31 11:40:04.273806] [DEBUG]    Script dir is /Users/emoret/bin/
[2018-07-31 11:40:04.273873] [DEBUG]    Parallel operations will use 4 CPUs
[2018-07-31 11:40:04.273937] [LOG]      Welcome to pdf2pdfocr version 1.2.3
[2018-07-31 11:40:04.280946] [LOG]      Input file /Users/emoret/Downloads/S.pdf: type is application/pdf
[2018-07-31 11:40:04.283982] [DEBUG]    Output file: /Users/emoret/Downloads/S-OCR.pdf for PDF and /Users/emoret/Downloads/S-OCR.pdf.txt for TXT
[2018-07-31 11:40:04.284095] [LOG]      Converting input file to images...
[2018-07-31 11:40:05.873390] [LOG]      Starting OCR...
[2018-07-31 11:40:05.883770] [LOG]      Waiting for OCR to complete. 0/1 pages completed...
[2018-07-31 11:40:10.888903] [LOG]      OCR completed
[2018-07-31 11:40:10.890272] [DEBUG]    We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
Temporary files kept in /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
LeoFCardoso commented 6 years ago

tesseract --list-langs ?? Are there any "tesserr*" files kept on your temp folder? Can you post its contents?

ericmoret commented 6 years ago
$ cat tess_err_CIIXL-1.log 
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Can not open file "/opt/local/share/tessdata//pdf.ttf"!
Error during processing.
SJCMAC41CBFVH8:T emoret$ ls /opt/local/share/tessdata/
eng.traineddata  fra.traineddata  osd.traineddata  pol.traineddata  por.traineddata  
LeoFCardoso commented 6 years ago

Language is ok. Looks like tesseract is upgraded to 3.05 in macports. My installation still uses 3.04. There is a bug in tesseract 3.05 with macports (https://trac.macports.org/ticket/56226). :( Can you try to copy "pdf.ttf" to correct folder manually?

ericmoret commented 6 years ago

Thank you for your help, I reverted to tesseract 3.04 and it now works again. Followed those instructions: https://trac.macports.org/wiki/howto/InstallingOlderPort

LeoFCardoso commented 6 years ago

Great! I will keep this issue open to track macports bug with tesseract 3.05!

ericmoret commented 6 years ago

Fixed upstream