Tesseract 3.03-rc1 and newer git versions have basic integrated(!) mixed-mode single-page PDF rendering support

Wikinaut commented 10 years ago

[UPDATED: I removed the information about the tessedit_pdf_compression parameter, which has been recently removed in the HEAD branch version of tesseract.]

Thanks

Dear developers of OCRmyPDF, first at all, thank you for your impressive and great work.

News

In the meantime, after your publications in autumn 2013 and later in German magazine "c't", the Tesseract developers integrated a similar PDF output support into their code starting in February 2014, which makes - not in all, but in many standard cases - the OCRmyPDF framework obsolete, in my personal view.

The new feature can currently only be used if you checkout Tesseract from their source in Google git ( https://code.google.com/p/tesseract-ocr/ explains how)

Searchable PDF output is a standard feature as of Tesseract version 3.03

https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_produce_searchable_PDF_output?

tesseract phototest.png phototest pdf

framework for multi-page pdfs

Tesseract cannot process multi-page PDFs as input.

Here is a framework example:

splitting (use pdftk) and converting (use convert from imagemagick) into single-page png images (lossless coded)
per page: tesseract OCR process and create a single-page mixed-mode PDF
merge (use pdftk) the single-page mixed-mode PDFs into the multi-page mixed-mode PDF

pdftk infile.pdf burst output $tmpdir/page_%03d.pdf
page=0
imagetype="png"

for file in $tmpdir/*.pdf
do 
    image=$file.$imagetype
    convert -density $density -depth $depth $file $image
    rm $file
    page=`expr $page + 1`
    tessoptions="--tessdata-dir "$tessdatadir" -l "$language" pdf"
    tesseract $image $image $tessoptions
    rm $image
done

pdftk $tmpdir/*.pdf cat output $tmpdir/tmp.pdf

fritz-hh commented 10 years ago

Hi Wikinaut,

Thank you for your message.

Tesseract delivered a 3.03-rc1 (i.e. release candidate 1) version on 4th of february 2014 (https://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes) According to the tesseract team, this pre-release is for developpers and testers only (see this thread: https://groups.google.com/d/msg/tesseract-ocr/er3ONslwbEE/IQozlErxz9sJ) For unclear reasons it has been delivered in some linux distributions (e.g. Ubuntu) and is advertised as tesseract 3.03 (even though it is 3-03-rc1).

Nevertheles but most linux/unix distribustions do not have any package of tesseract supporting PDF generation yet.

Even after the delivery of the next official version of tesseract (And there is not delivery date annouced yet), I believe that OCRmyPDF will be of strong interrest for many users. Indeed it provides the many functions that (to my knowledge) won't be part of the next tesseract release:

Generation of multipage PDF/A-1 file (i.e. meeting the requirements for long term archivation)
Fast generation as it makes use of all CPU cores instead of using just one core
Keepts exact resolution of the original embedded images
If required performs deskews and / or clean the image before performing OCR
Provides a debug mode to enable easy verification of the OCR results

For the mean time I keep yout ticket open, as it might be help for some users to know that 3.03-rc1 support single page pdf generation from images

Wikinaut commented 10 years ago

@fritz-hh thanks for your detailed and correct analysis. I also think that the OCRmyPDF framework offers more options but requires also more resource (dependencies).

The main purpose of my contribution is to point you and other potential users to the tesseract-now-integrated (basic, single-page) pdf support - the original c't article, the follow-up articles, and also the OCRmyPDF framework were and are silent about this fact. It is understandable, because the OCRmyPDF framework was devloped earlier than tesseract's new pdf option.

jbarlow83 commented 9 years ago

In OCRmyPDF v3.0-rc2 we're now taking advantage of Tesseract's improved (single page only) PDF output to improve the overall results of OCRmyPDF. Tesseract doesn't do everything we need.

Wikinaut commented 9 years ago

@jbarlow83 thanks for the info.

Wikinaut commented 9 years ago

Apparently, the Tesseract PDF rendering mode (Tesseract versions > 3.02 can generate mixed-mode PDFs directly) which has been proposed in the present issue, can be achieved by starting OCRMyPDF with the commandline option

--pdf-renderer tesseract

(introduced in OCRMyPDf version 3.0) like in the example

ocrmypdf -l deu --pdf-renderer tesseract infile.pdf outfile.pdf

See https://github.com/fritz-hh/OCRmyPDF/blob/master/RELEASE_NOTES.rst . @jbarlow83 Thanks for implementing this!

fritz-hh / OCRmyPDF