fritz-hh / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
260 stars 31 forks source link

OCR fails"Could not OCR file"...". Exiting... #39

Closed sch82812121 closed 10 years ago

sch82812121 commented 10 years ago

Thanks a lot for instant resolution of Issue #37! I pulled the latest release and was able to proceed 1 step further.

Unfortunately, I am still stuck OCR-ing the PDFs created by my HP 8500A... I created a simple 1-page scan (tried both: "600dpi; compressed image within PDF" and "600dpi; uncompressed image within PDF". In both cases OCRmyPDF failed to execute.

If you like to investigate, I could provide the PDF; my anti-spam email is ocrmypdf.10.mat77@spamgourmet.org

---8<---

ocrmypdf -vvvg beleg0062.pdf beleg0062-ocrmypdf.pdf

OCRmyPDF version: v2.x Checking if all dependencies are installed Creating temporary folder: "./tmp/20140101_1157.filename.beleg0062" Input file: Extracting size of each page (in pt) Processing page 0001 / 0001 Page 0001: Size 842x595 (h_w in pt) Page 0001: Size 6960x5088 (h_w pixel) Page 0001: Embedded image resolution is 616 dpi Page 0001: (x/y) resolution mismatch (615.69075/595.15439). Difference should be less than 1.74162. Taking biggest value Page 0001: Extracting image as ppm file (616 dpi) Page 0001: Performing OCR Could not OCR file "./tmp/20140101_1157.filename.beleg0062/0001.cleaned.ppm". Exiting... parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed: ./src/ocrPage.sh /home/samba-shares/family/scans/beleg0062.pdf 0001\ 595\ 842 0001 ./tmp/20140101_1157.filename.beleg0062 3 eng 1 0 0 0 0 1 ''

sch82812121 commented 10 years ago

In my case, an update from Ubuntu 10.04. to 12.04. (and manually installing python-reportlab) has partially resolved the issue. Maybe one of my tools was too old...

Now for some of my scans it works, for others it exits with verbose output (see example below)...

OCRmyPDF.sh beleg0053.pdf beleg0053-ocr.pdf

Could not create PDF file from "./tmp/20140103_1521.filename.beleg0053/0002.hocr". Exiting... Traceback (most recent call last): File "./src/hocrTransform.py", line 181, in hocr = hocrTransform(args.hocrfile, args.resolution) File "./src/hocrTransform.py", line 26, in init self.hocr.parse(hocrFileName) File "lxml.etree.pyx", line 1722, in lxml.etree._ElementTree.parse (src/lxml/lxml.etree.c:44643) File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82287) File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82580) File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81619) File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78528) File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472) File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363) File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 16, line 219, column 143 parallel: Starting no more jobs. Waiting for 2 jobs to finish. This job failed: ./src/ocrPage.sh /home/samba-shares/family/scans/beleg0053.pdf 0002\ 595\ 842 0004 ./tmp/20140103_1521.filename.beleg0053 0 eng 0 0 0 0 0 0 '' Could not create PDF file from "./tmp/20140103_1521.filename.beleg0053/0003.hocr". Exiting... Traceback (most recent call last): File "./src/hocrTransform.py", line 181, in hocr = hocrTransform(args.hocrfile, args.resolution) File "./src/hocrTransform.py", line 26, in init self.hocr.parse(hocrFileName) File "lxml.etree.pyx", line 1722, in lxml.etree._ElementTree.parse (src/lxml/lxml.etree.c:44643) File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82287) File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82580) File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81619) File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78528) File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472) File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363) File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696) lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding ! Bytes: 0xC0 0x2A 0x47 0x03, line 102, column 239 parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed: ./src/ocrPage.sh /home/samba-shares/family/scans/beleg0053.pdf 0003\ 595\ 842 0004 ./tmp/20140103_1521.filename.beleg0053 0 eng 0 0 0 0 0 0 '' #

...

fritz-hh commented 10 years ago

Could you please share one of the input pdf that fails and share the link in the problem report? If you could share the content of the tmp folder too if would be great. Which version of tesseract are you using now? (I will change the script to Echo the version of the tools in debug mode...)

sch82812121 commented 10 years ago

Thanks a lot for your support. After the upgrade, "teseract -v" reports"3.02".

I have created a debug package at: https://www.wetransfer.com/downloads/0d66de043578679abbcf9de46bf8bb2b20140104132235/eeadfcdaa0067431af5a7178ade7017220140104132235/e42a5c

The package contains an OCR of a file that succeeds (beleg 63) and a file that fails (beleg 66) with all log and tmp files.

Feel free to email me if you need more info.

fritz-hh commented 10 years ago

Hi,

What I did:

Analyzing the your tesseract output I found out that the file contains invalid characters (mainly STX characters). It seems that it is a bug in tesseract v3.02 (I am running v3.02.02 and it works fine). Do you have the possibility to get v3.02.02 for your OS?

sch82812121 commented 10 years ago

I downloaded and compiled Tesseract 3.02.02 (Ubuntu) and it resolved the issue. Thanks a lot! Maybe your dependency checker should issue a warning if Tesseract 3.0.2 is installed...

(btw: If you accept donations, you can email me your SEPA/BIC at ocrmypdf.20.mat77@spamgourmet.org)