fritz-hh / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
260 stars 31 forks source link

Fehler beim Erzeugen des PDFs aus .hocr #72

Closed gitmaster2013 closed 10 years ago

gitmaster2013 commented 10 years ago

Hallo zusammen, Ich habe gerade versucht das Skript zu testen. Leider stosse ich immer auf einen Fehler dieser Art: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

sh OCRmyPDF-2_0/OCRmyPDF.sh -g -c -l deu test.pdf test-OCR.pdf

OCRmyPDF version: v2.0-stable Arguments: -g -c -l deu test.pdf test-OCR.pdf

Checking if all dependencies are installed

ImageMagick version: Version: ImageMagick 6.7.7-10 2014-03-03 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP


GNU Parallel version: GNU parallel 20130922 Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version: pdfimages version 0.22.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdftoppm version 0.22.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdffonts version 0.22.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

0.4.2

tesseract version: tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0


python2 version:

Python 2.7.6

Ghostscript version:

9.05

Java version: java version "1.7.0_51" OpenJDK Runtime Environment (IcedTea 2.4.5) (7u51-2.4.5-2)

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.reQjvra5iC" Input file: Extracting size of each page (in pt) Processing page 0001 / 0001 Page 0001: Size 842x595 (h*w in pt) Page 0001: Size 3488x2544 (in pixel) Page 0001: Extracting image as pbm file (303 dpi) Page 0001: Cleaning image with unpaper Page 0001: Performing OCR Page 0001: Embedding text in PDF Could not create PDF file from "/tmp/tmp.reQjvra5iC/0001.hocr". Exiting... mv: der Aufruf von stat für „/tmp/tmp.reQjvra5iC/0001.hocr.html“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden Traceback (most recent call last): File "./src/hocrTransform.py", line 281, in hocr = hocrTransform(args.hocrfile, args.resolution) File "./src/hocrTransform.py", line 110, in init self.hocr.parse(hocrFileName) File "lxml.etree.pyx", line 1795, in lxml.etree._ElementTree.parse (src/lxml/lxml.etree.c:54431) File "parser.pxi", line 1748, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102066) File "parser.pxi", line 1774, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102330) File "parser.pxi", line 1678, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101365) File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96817) File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275) File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461) File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91722) IOError: Error reading file '/tmp/tmp.reQjvra5iC/0001.hocr': failed to load external entity "/tmp/tmp.reQjvra5iC/0001.hocr" parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed: ./src/ocrPage.sh /root/test.pdf 0001\ 595\ 842 0001 /tmp/tmp.reQjvra5iC 3 deu 1 0 1 0 0 1 '' 0 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Es wäre super wenn mir jemand einen Tipp geben könnte, woran es liegen kann. Vielen Dank!

hierchristian commented 10 years ago

Hello,

i'm getting the same error

Look at the tmp folder: lrwxrwxrwx 1 root root 17 Mär 28 12:02 0001.cleaned.ppm -> 0001.deskewed.ppm lrwxrwxrwx 1 root root 8 Mär 28 12:02 0001.deskewed.ppm -> 0001.ppm -rw-r--r-- 1 root root 63492 Mär 28 12:03 0001.hocr.hocr -rw-r--r-- 1 root root 367153 Mär 28 12:02 0001.orig-img-000.jpg -rw-r--r-- 1 root root 33 Mär 28 12:02 0001.orig-img-info.txt -rw-r--r-- 1 root root 10996739 Mär 28 12:02 0001.ppm -rw-r--r-- 1 root root 26 Mär 28 12:02 pages-info.txt -rw-r--r-- 1 root root 17 Mär 28 12:02 tmp.txt

File "0001.hocr.hocr" could be the problem

Maybe this helps.

hierchristian commented 10 years ago

Hi,

in ocrPage.sh change mv"$curHocr.html" "$curHocr" to mv "$curHocr.hocr" "$curHocr". Sry, dont know the line.

Now it works at least for me.

Bye

MzunguKichaa commented 10 years ago

Hey,

I'm Having the same issue as gitmaster2013 after upgrading from Kubuntu 13.10 to 14.04 However, the tip from miamoebel doesn't work for me. After changing to $curHocr.hocr" "$curHocr" I get the following:

malte@Malte-Laptop:~/scanner$ sudo sh /home/malte/OCR/OCRmyPDF.sh -g /home/malte/test.pdf /home/malte/test2.pdf OCRmyPDF version: v2.0-stable Arguments: -g /home/malte/test.pdf /home/malte/test2.pdf

Checking if all dependencies are installed

ImageMagick version: Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP


GNU Parallel version: GNU parallel 20130922 Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version: pdfimages version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdftoppm version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdffonts version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

0.4.2

tesseract version: tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0


python2 version:

Python 2.7.6

Ghostscript version:

9.10

Java version: java version "1.7.0_51" OpenJDK Runtime Environment (IcedTea 2.4.6) (7u51-2.4.6-1ubuntu4)

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.rgkjc1ANpP" Input file: Extracting size of each page (in pt) Processing page 0001 / 0001 Page 0001: Size 842x595 (h*w in pt) Page 0001: Size 3504x2480 (in pixel) Page 0001: Extracting image as ppm file (300 dpi) Page 0001: Performing OCR Page 0001: Embedding text in PDF Could not create PDF file from "/tmp/tmp.rgkjc1ANpP/0001.hocr". Exiting... Traceback (most recent call last): File "./src/hocrTransform.py", line 283, in hocr.to_pdf(args.outputfile, args.image, args.boundingboxes) File "./src/hocrTransform.py", line 266, in to_pdf pdf.drawInlineImage(im, 0, 0, width=self.width, height=self.height) File "/usr/lib/python2.7/dist-packages/reportlab/pdfgen/canvas.py", line 826, in drawInlineImage img_obj = PDFImage(image, x,y, width, height) File "/usr/lib/python2.7/dist-packages/reportlab/pdfgen/pdfimages.py", line 40, in init self.getImageData() File "/usr/lib/python2.7/dist-packages/reportlab/pdfgen/pdfimages.py", line 165, in getImageData imagedata, imgwidth, imgheight = self.PIL_imagedata() File "./src/hocrTransform.py", line 50, in PIL_imagedata from reportlab.pdfbase.pdfutils import _AsciiBase85Encode, _chunker ImportError: cannot import name _AsciiBase85Encode parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed: ./src/ocrPage.sh /home/malte/test.pdf 0001\ 595\ 842 0001 /tmp/tmp.rgkjc1ANpP 3 eng 1 0 0 0 0 1 '' 0

Daniel-KM commented 10 years ago

Hi,

The library python-reportlab 3.0-1 doesn't work, you should force the 2.7-1 (package in testing).

Sincerely,

Daniel Berthereau Infodoc & Knowledge management

MzunguKichaa commented 10 years ago

Thanks Daniel,

that worked for me (but I'm using reportlab 2.6-1, didn't get the 2.7-1) in conclusion with the fix from miamoebel. If I don't change the mv... line as miamoebel suggested, I'm still getting errors. If you want, I can post these errors.

Thank you very much!

Malte

achrist42 commented 10 years ago

Hi,

I can reproduce this issue using Ubuntu 14.04. The problem stems from new versions of tesseract and reportlab.

On my system the following two steps are necessary to make OCRmyPDF work:

  1. As @miamoebel already mentioned in the file ocrPage.sh on line 190 change mv "$curHocr.html" "$curHocr" to mv "$curHocr.hocr" "$curHocr".
  2. In recent versions of reportlab _AsciiBase85Encode changed to asciiBase85Encode. This pull request by @dreuter shows how to fix this issue by editing hocrTransform.py: https://github.com/fritz-hh/OCRmyPDF/pull/71/files
AvanOsch commented 10 years ago

Thanks @andreas-christ! That solved my problem, without (re-)installing/downgrading anything. Line 190 of ocrPage.sh, and just 2 changes to make in hocrTransform.py: Line 50 and 92