fritz-hh / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
260 stars 31 forks source link

original images not kept unaltered #78

Closed femifrak closed 8 years ago

femifrak commented 10 years ago

When using the 2.x version available as zip file at the right side of https://github.com/fritz-hh/OCRmyPDF with xubuntu 14.04 the original pdf is altered although i did not use -i The first page of http://www.loaditup.de/files/817245_gcstsh3wuy.pdf shows the original black and white pdf, the second page the altered pdf which unfortunately looks frazzled. (I merged both pages for convenience.) Is there a way to avoid this quality loss?

I tested the suggestion of #61 but without success, which is clear as no "-i" was used. I also tested a pdf with integer number of pixels but without success. Maybe it has to do with the problem described here? http://lists.freedesktop.org/archives/poppler-bugs/2013-August/010469.html

Thanks for the help.

Here the output with -g:

># /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g sw_original.pdf sw_original_OCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g sw_original.pdf sw_original_OCR.pdf
Checking if all dependencies are installed
--------------------------------
ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP    

--------------------------------
GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, 
;login: The USENIX Magazine, February 2011:42-47.
--------------------------------
Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
--------------------------------
unpaper version:
0.4.2
--------------------------------
tesseract version:
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

--------------------------------
python2 version:
Python 2.7.6
--------------------------------
Ghostscript version:
9.10
--------------------------------
Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
--------------------------------
Created temporary folder: "/tmp/tmp.ZIHGjUFKJS"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x594 (h*w in pt)
Page 0001: Size 3508x2477 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.ZIHGjUFKJS/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 25 seconds
fritz-hh commented 9 years ago

Unfortunately, it is not (easily) possible to extract the original images from a pdf file using opensource linux sw and keep its orientation like in the pdf file. Therefore I use pdftoppm to GENERATE an image from the pdf file. The image is generated with the same resolution than the original image, but it is not the original image.

If anybody has an idea how proceed to solve this limitation, please let me know!

jbarlow83 commented 9 years ago

The PyPDF2 library can read internal PDF structure and get the page orientation.

$ ipython3
import PyPDF2 as pypdf
pdf = pypdf.PdfFileReader('example.pdf')
pdf.pages[0]['/Rotate']

That field records the rotation anyone has applied to fix the orientation of a given page and should work in a lot of cases. It would be possible to get the page and image dimensions as well, I believe, and faster than the various calls to pdftoppm and pdfimages since everything could be done in a single process. That would be a first step and should work as long as each page contains one image that fills the page – probably good enough for most scanned PDFs with no OCR.

For multiple images to a page you would have to interpret the PostScript to determine where images are rendered, because PostScript can apply an arbitrary transformation matrix (translate, rotate to arbitrary angle, scale, skew) to an image before rendering an image. In this case you'd run OCR jobs on the extracted images and then insert the OCR hidden layer into the PostScript stream. Needless to say this would be much harder.

fritz-hh commented 9 years ago

Even if the page contains only 1 image, I am not sure knowing the page orientation would be enough. Indeed there would still be 2 possible rotation angles for the image (x and x+180°). Are there tools to easily extract the transformation matrix of the image? (especially the rotation angle?

jbarlow83 commented 9 years ago

It's not page orientation as in landscape/portrait use of paper. /Rotate records the 0/90/180/270° rotation that is usually set by a user. It should be enough to determine the image rotation for simple cases like scanned PDF output. The /MediaBox is also part of the picture - one can specify the virtual paper size with /MediaBox, and then rotate it. So for the simple case (scanner PDF output), with /MediaBox and /Rotate, you should be able to determine the orientation of the image.

If I understand correctly the transformation matrix is sort of like a CPU register in the PostScript language, sensitive to state, so in general you have to interpret all of the preceding PostScript on a page to determine its value at a point of interest. So it is harder (although pdftoppm would have code to do this). But this should only be necessary for more complex PDFs, not the output of scanning software.

fritz-hh commented 9 years ago

Ok. I understand what you mean now. I would propose to interoduce this feature one ocrpage has been rewritten in python

kebekus commented 9 years ago

Actually, I believe adding a text layer without changing the original contents of the PDF file is easy. PDFTK can do that. Here is what I do with my personal files:

Here is a little shell script that does a very similar task (it takes the images from my scanner, not from a PDF file).

#!/bin/bash

# Clear directory
rm -f *.pnm *.djvu

# Scan images
scanimage --batch=scan\%03d.pnm --mode=Gray --adf-auto-scan=yes -x 210mm -y 296mm --resolution 600 --adf-mode=Simplex

# Threshold and cut scanned pages
for page in `ls scan*|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Cut and threshold page $file_name_witout_ext"
    pnmcut -left 105 -bottom 6500 <$page | pgmtopbm -threshold -value 0.78 >$file_name_witout_ext.pbm
done

# Generate PDF document
echo "Compress as PDF"
jbig2 -v -p -s scan*.pbm
pdf.py output >fertig.pdf
# Delete jbig2 temporary files
rm output* 

# OCR

# OCR each page, and produce PDF file(s) containing the background (=text) layer
for page in `ls scan*.pbm|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Character recognition $page"
    tesseract -l eng $page $file_name_witout_ext hocr >/dev/null
    python2 ~/bin/OCRmyPDF-2.2-stable/src/hocrTransform.py -r 600 $file_name_witout_ext.html $file_name_witout_ext.pdf
    # Delete temporary hocr file
    rm $file_name_witout_ext.html
done

# Join PDF files into one file that contains all OCR backgrounds
pdftk scan*.pdf output ocr.pdf
# Delete temporary scan*.pdf files
rm scan*.pdf

# Merge OCR background PDF into the main PDF document
pdftk fertig.pdf multibackground ocr.pdf output fertig-ocr.pdf