Closed gitmaster2013 closed 10 years ago
i'm getting the same error
Look at the tmp folder: lrwxrwxrwx 1 root root 17 Mär 28 12:02 0001.cleaned.ppm -> 0001.deskewed.ppm lrwxrwxrwx 1 root root 8 Mär 28 12:02 0001.deskewed.ppm -> 0001.ppm -rw-r--r-- 1 root root 63492 Mär 28 12:03 0001.hocr.hocr -rw-r--r-- 1 root root 367153 Mär 28 12:02 0001.orig-img-000.jpg -rw-r--r-- 1 root root 33 Mär 28 12:02 0001.orig-img-info.txt -rw-r--r-- 1 root root 10996739 Mär 28 12:02 0001.ppm -rw-r--r-- 1 root root 26 Mär 28 12:02 pages-info.txt -rw-r--r-- 1 root root 17 Mär 28 12:02 tmp.txt
File "0001.hocr.hocr" could be the problem
Maybe this helps.
in change mv"$curHocr.html" "$curHocr" to mv "$curHocr.hocr" "$curHocr". Sry, dont know the line.
Now it works at least for me.
I'm Having the same issue as gitmaster2013 after upgrading from Kubuntu 13.10 to 14.04 However, the tip from miamoebel doesn't work for me. After changing to $curHocr.hocr" "$curHocr" I get the following:
malte@Malte-Laptop:~/scanner$ sudo sh /home/malte/OCR/ -g /home/malte/test.pdf /home/malte/test2.pdf OCRmyPDF version: v2.0-stable Arguments: -g /home/malte/test.pdf /home/malte/test2.pdf
ImageMagick version: Version: ImageMagick 6.7.7-10 2014-03-06 Q16 Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP
GNU Parallel version: GNU parallel 20130922 Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty.
Web site:
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
Poppler-utils version: pdfimages version 0.24.5 Copyright 2005-2013 The Poppler Developers - Copyright 1996-2011 Glyph & Cog, LLC pdftoppm version 0.24.5 Copyright 2005-2013 The Poppler Developers - Copyright 1996-2011 Glyph & Cog, LLC pdffonts version 0.24.5 Copyright 2005-2013 The Poppler Developers -
unpaper version:
tesseract version: tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
python2 version:
Ghostscript version:
Java version: java version "1.7.0_51" OpenJDK Runtime Environment (IcedTea 2.4.6) (7u51-2.4.6-1ubuntu4)
Created temporary folder: "/tmp/tmp.rgkjc1ANpP"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x595 (h*w in pt)
Page 0001: Size 3504x2480 (in pixel)
Page 0001: Extracting image as ppm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Could not create PDF file from "/tmp/tmp.rgkjc1ANpP/0001.hocr". Exiting...
Traceback (most recent call last):
File "./src/", line 283, in
The library python-reportlab 3.0-1 doesn't work, you should force the 2.7-1 (package in testing).
Daniel Berthereau Infodoc & Knowledge management
Thanks Daniel,
that worked for me (but I'm using reportlab 2.6-1, didn't get the 2.7-1) in conclusion with the fix from miamoebel. If I don't change the mv... line as miamoebel suggested, I'm still getting errors. If you want, I can post these errors.
Thank you very much!
I can reproduce this issue using Ubuntu 14.04. The problem stems from new versions of tesseract and reportlab.
On my system the following two steps are necessary to make OCRmyPDF work:
mv "$curHocr.html" "$curHocr"
to mv "$curHocr.hocr" "$curHocr"
changed to asciiBase85Encode
. This pull request by @dreuter shows how to fix this issue by editing @andreas-christ! That solved my problem, without (re-)installing/downgrading anything. Line 190 of, and just 2 changes to make in Line 50 and 92
Hallo zusammen, Ich habe gerade versucht das Skript zu testen. Leider stosse ich immer auf einen Fehler dieser Art: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
sh OCRmyPDF-2_0/ -g -c -l deu test.pdf test-OCR.pdf
OCRmyPDF version: v2.0-stable Arguments: -g -c -l deu test.pdf test-OCR.pdf
Checking if all dependencies are installed
ImageMagick version: Version: ImageMagick 6.7.7-10 2014-03-03 Q16 Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP
GNU Parallel version: GNU parallel 20130922 Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty.
Web site:
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
Poppler-utils version: pdfimages version 0.22.5 Copyright 2005-2013 The Poppler Developers - Copyright 1996-2011 Glyph & Cog, LLC pdftoppm version 0.22.5 Copyright 2005-2013 The Poppler Developers - Copyright 1996-2011 Glyph & Cog, LLC pdffonts version 0.22.5 Copyright 2005-2013 The Poppler Developers -
Copyright 1996-2011 Glyph & Cog, LLC
unpaper version:
tesseract version: tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
python2 version:
Python 2.7.6
Ghostscript version:
Java version: java version "1.7.0_51" OpenJDK Runtime Environment (IcedTea 2.4.5) (7u51-2.4.5-2)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
Created temporary folder: "/tmp/tmp.reQjvra5iC" Input file: Extracting size of each page (in pt) Processing page 0001 / 0001 Page 0001: Size 842x595 (h*w in pt) Page 0001: Size 3488x2544 (in pixel) Page 0001: Extracting image as pbm file (303 dpi) Page 0001: Cleaning image with unpaper Page 0001: Performing OCR Page 0001: Embedding text in PDF Could not create PDF file from "/tmp/tmp.reQjvra5iC/0001.hocr". Exiting... mv: der Aufruf von stat für „/tmp/tmp.reQjvra5iC/0001.hocr.html“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden Traceback (most recent call last): File "./src/", line 281, in
hocr = hocrTransform(args.hocrfile, args.resolution)
File "./src/", line 110, in init
File "lxml.etree.pyx", line 1795, in lxml.etree._ElementTree.parse (src/lxml/lxml.etree.c:54431)
File "parser.pxi", line 1748, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102066)
File "parser.pxi", line 1774, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102330)
File "parser.pxi", line 1678, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101365)
File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96817)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91722)
IOError: Error reading file '/tmp/tmp.reQjvra5iC/0001.hocr': failed to load external entity "/tmp/tmp.reQjvra5iC/0001.hocr"
parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed:
./src/ /root/test.pdf 0001\ 595\ 842 0001 /tmp/tmp.reQjvra5iC 3 deu 1 0 1 0 0 1 '' 0
Es wäre super wenn mir jemand einen Tipp geben könnte, woran es liegen kann.
Vielen Dank!