fritz-hh / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
260 stars 31 forks source link

problem with unpaper #75

Closed femifrak closed 8 years ago

femifrak commented 10 years ago

When using OCRmyODF-2.x with -dci there remain black borders in the generated pdf. Shouldn't unpaper remove them? The input is a black and white pdf.

Here the output:

root@xu:/home/tho/test# /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g -d -c -i test.pdf testOCR.pdf OCRmyPDF version: v2.0-stable Arguments: -g -d -c -i test.pdf testOCR.pdf

Checking if all dependencies are installed

ImageMagick version: Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP


GNU Parallel version: GNU parallel 20130922 Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version: pdfimages version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdftoppm version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdffonts version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

0.4.2

tesseract version: tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0


python2 version:

Python 2.7.6

Ghostscript version:

9.10

Java version: java version "1.7.0_55" OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.cL2lCvVStC" Input file: Extracting size of each page (in pt) Processing page 0001 / 0001 Page 0001: Size 578x342 (h*w in pt) Page 0001: Size 3424x2208 (in pixel) Page 0001: Extracting image as pbm file (445 dpi) Page 0001: Deskewing image Page 0001: Cleaning image with unpaper Page 0001: Performing OCR Page 0001: Embedding text in PDF Page 0001: Embedding text in PDF (debug page) Output file: Concatenating all pages to the final PDF/A file Output file: Checking compliance to PDF/A standard The full validation log is available here: "/tmp/tmp.cL2lCvVStC/pdf_validation.log" Output file: The generated PDF/A file is VALID Script took 31 seconds

femifrak commented 10 years ago

i just see in ocrPage.sh that unpaper is called with the arguments:

--mask-scan-size 100 --no-deskew --no-grayfilter --no-blackfilter --no-mask-center --no-border-align

Are there any disadvantages when removing the last four arguments or some of them?

Thanks!

fritz-hh commented 10 years ago

Hi. I had some issues with some documents (e.g. images removed by unpaper in some case), therefore I added these parameters. Feel free to change it if it better fits your use case.