Closed femifrak closed 8 years ago
i just see in ocrPage.sh that unpaper is called with the arguments:
--mask-scan-size 100 --no-deskew --no-grayfilter --no-blackfilter --no-mask-center --no-border-align
Are there any disadvantages when removing the last four arguments or some of them?
Thanks!
Hi. I had some issues with some documents (e.g. images removed by unpaper in some case), therefore I added these parameters. Feel free to change it if it better fits your use case.
When using OCRmyODF-2.x with -dci there remain black borders in the generated pdf. Shouldn't unpaper remove them? The input is a black and white pdf.
Here the output:
root@xu:/home/tho/test# /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g -d -c -i test.pdf testOCR.pdf OCRmyPDF version: v2.0-stable Arguments: -g -d -c -i test.pdf testOCR.pdf
Checking if all dependencies are installed
ImageMagick version: Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP
GNU Parallel version: GNU parallel 20130922 Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
Poppler-utils version: pdfimages version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdftoppm version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC pdffonts version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
unpaper version:
0.4.2
tesseract version: tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
python2 version:
Python 2.7.6
Ghostscript version:
9.10
Java version: java version "1.7.0_55" OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
Created temporary folder: "/tmp/tmp.cL2lCvVStC" Input file: Extracting size of each page (in pt) Processing page 0001 / 0001 Page 0001: Size 578x342 (h*w in pt) Page 0001: Size 3424x2208 (in pixel) Page 0001: Extracting image as pbm file (445 dpi) Page 0001: Deskewing image Page 0001: Cleaning image with unpaper Page 0001: Performing OCR Page 0001: Embedding text in PDF Page 0001: Embedding text in PDF (debug page) Output file: Concatenating all pages to the final PDF/A file Output file: Checking compliance to PDF/A standard The full validation log is available here: "/tmp/tmp.cL2lCvVStC/pdf_validation.log" Output file: The generated PDF/A file is VALID Script took 31 seconds