Closed ericmoret closed 6 years ago
Hi there. What python version? Can I have access to this PDF file ?
Thanks. I’m on vacation now. Will check when back to home. Please wait for about 10 days...
Hi Eric. I could process your PDF file in my installation. Can you please post your complete command line?
$ pdf2pdfocr.py -i S.pdf -l fra -v
[2018-07-31 11:20:23.557974] [DEBUG] Temp dir is /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
[2018-07-31 11:20:23.558115] [DEBUG] Prefix is H82ML
[2018-07-31 11:20:23.558165] [DEBUG] Script dir is /Users/emoret/bin/
[2018-07-31 11:20:23.558221] [DEBUG] Parallel operations will use 4 CPUs
[2018-07-31 11:20:23.558324] [LOG] Welcome to pdf2pdfocr version 1.2.3
[2018-07-31 11:20:23.564893] [LOG] Input file /Users/emoret/Downloads/S.pdf: type is application/pdf
[2018-07-31 11:20:23.568274] [DEBUG] Output file: /Users/emoret/Downloads/S-OCR.pdf for PDF and /Users/emoret/Downloads/S-OCR.pdf.txt for TXT
[2018-07-31 11:20:23.568386] [LOG] Converting input file to images...
[2018-07-31 11:20:25.147699] [LOG] Starting OCR...
[2018-07-31 11:20:25.161726] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2018-07-31 11:20:30.165266] [LOG] OCR completed
[2018-07-31 11:20:30.166331] [DEBUG] We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
Looks like tesseract could not execute with success. Is French (-l fra) correctly installed? Can you run with "-k" and post a zip file with temp files generated that contains the "prefix" (H82ML) in your last execution...
$ port installed | grep tesseract-fra
tesseract-fra @3.04_1 (active)
SJCMAC41CBFVH8:Downloads emoret$ pdf2pdfocr.py -i S.pdf -l fra -v -k
[2018-07-31 11:40:04.273624] [DEBUG] Temp dir is /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
[2018-07-31 11:40:04.273757] [DEBUG] Prefix is CIIXL
[2018-07-31 11:40:04.273806] [DEBUG] Script dir is /Users/emoret/bin/
[2018-07-31 11:40:04.273873] [DEBUG] Parallel operations will use 4 CPUs
[2018-07-31 11:40:04.273937] [LOG] Welcome to pdf2pdfocr version 1.2.3
[2018-07-31 11:40:04.280946] [LOG] Input file /Users/emoret/Downloads/S.pdf: type is application/pdf
[2018-07-31 11:40:04.283982] [DEBUG] Output file: /Users/emoret/Downloads/S-OCR.pdf for PDF and /Users/emoret/Downloads/S-OCR.pdf.txt for TXT
[2018-07-31 11:40:04.284095] [LOG] Converting input file to images...
[2018-07-31 11:40:05.873390] [LOG] Starting OCR...
[2018-07-31 11:40:05.883770] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2018-07-31 11:40:10.888903] [LOG] OCR completed
[2018-07-31 11:40:10.890272] [DEBUG] We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
Temporary files kept in /var/folders/5j/_5d9d6r50mq6cstkm_sq_ptwp58nn5/T/
tesseract --list-langs ?? Are there any "tesserr*" files kept on your temp folder? Can you post its contents?
$ cat tess_err_CIIXL-1.log
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Can not open file "/opt/local/share/tessdata//pdf.ttf"!
Error during processing.
SJCMAC41CBFVH8:T emoret$ ls /opt/local/share/tessdata/
eng.traineddata fra.traineddata osd.traineddata pol.traineddata por.traineddata
Language is ok. Looks like tesseract is upgraded to 3.05 in macports. My installation still uses 3.04. There is a bug in tesseract 3.05 with macports (https://trac.macports.org/ticket/56226). :( Can you try to copy "pdf.ttf" to correct folder manually?
Thank you for your help, I reverted to tesseract 3.04 and it now works again. Followed those instructions: https://trac.macports.org/wiki/howto/InstallingOlderPort
Great! I will keep this issue open to track macports bug with tesseract 3.05!
Fixed upstream
Something seems to be wrong. I am running MacOS 10.13.6 with a fresh macports installation.