Closed sch82812121 closed 10 years ago
In my case, an update from Ubuntu 10.04. to 12.04. (and manually installing python-reportlab) has partially resolved the issue. Maybe one of my tools was too old...
Now for some of my scans it works, for others it exits with verbose output (see example below)...
Could not create PDF file from "./tmp/20140103_1521.filename.beleg0053/0002.hocr". Exiting...
Traceback (most recent call last):
File "./src/hocrTransform.py", line 181, in
...
Could you please share one of the input pdf that fails and share the link in the problem report? If you could share the content of the tmp folder too if would be great. Which version of tesseract are you using now? (I will change the script to Echo the version of the tools in debug mode...)
Thanks a lot for your support. After the upgrade, "teseract -v" reports"3.02".
I have created a debug package at: https://www.wetransfer.com/downloads/0d66de043578679abbcf9de46bf8bb2b20140104132235/eeadfcdaa0067431af5a7178ade7017220140104132235/e42a5c
The package contains an OCR of a file that succeeds (beleg 63) and a file that fails (beleg 66) with all log and tmp files.
Feel free to email me if you need more info.
Hi,
What I did:
Analyzing the your tesseract output I found out that the file contains invalid characters (mainly STX characters). It seems that it is a bug in tesseract v3.02 (I am running v3.02.02 and it works fine). Do you have the possibility to get v3.02.02 for your OS?
I downloaded and compiled Tesseract 3.02.02 (Ubuntu) and it resolved the issue. Thanks a lot! Maybe your dependency checker should issue a warning if Tesseract 3.0.2 is installed...
(btw: If you accept donations, you can email me your SEPA/BIC at ocrmypdf.20.mat77@spamgourmet.org)
Thanks a lot for instant resolution of Issue #37! I pulled the latest release and was able to proceed 1 step further.
Unfortunately, I am still stuck OCR-ing the PDFs created by my HP 8500A... I created a simple 1-page scan (tried both: "600dpi; compressed image within PDF" and "600dpi; uncompressed image within PDF". In both cases OCRmyPDF failed to execute.
If you like to investigate, I could provide the PDF; my anti-spam email is ocrmypdf.10.mat77@spamgourmet.org
---8<---
ocrmypdf -vvvg beleg0062.pdf beleg0062-ocrmypdf.pdf
OCRmyPDF version: v2.x Checking if all dependencies are installed Creating temporary folder: "./tmp/20140101_1157.filename.beleg0062" Input file: Extracting size of each page (in pt) Processing page 0001 / 0001 Page 0001: Size 842x595 (h_w in pt) Page 0001: Size 6960x5088 (h_w pixel) Page 0001: Embedded image resolution is 616 dpi Page 0001: (x/y) resolution mismatch (615.69075/595.15439). Difference should be less than 1.74162. Taking biggest value Page 0001: Extracting image as ppm file (616 dpi) Page 0001: Performing OCR Could not OCR file "./tmp/20140101_1157.filename.beleg0062/0001.cleaned.ppm". Exiting... parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed: ./src/ocrPage.sh /home/samba-shares/family/scans/beleg0062.pdf 0001\ 595\ 842 0001 ./tmp/20140101_1157.filename.beleg0062 3 eng 1 0 0 0 0 1 ''