Closed Wikinaut closed 9 years ago
Hi Wikinaut,
Thank you for your message.
Tesseract delivered a 3.03-rc1 (i.e. release candidate 1) version on 4th of february 2014 (https://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes) According to the tesseract team, this pre-release is for developpers and testers only (see this thread: https://groups.google.com/d/msg/tesseract-ocr/er3ONslwbEE/IQozlErxz9sJ) For unclear reasons it has been delivered in some linux distributions (e.g. Ubuntu) and is advertised as tesseract 3.03 (even though it is 3-03-rc1).
Nevertheles but most linux/unix distribustions do not have any package of tesseract supporting PDF generation yet.
Even after the delivery of the next official version of tesseract (And there is not delivery date annouced yet), I believe that OCRmyPDF will be of strong interrest for many users. Indeed it provides the many functions that (to my knowledge) won't be part of the next tesseract release:
For the mean time I keep yout ticket open, as it might be help for some users to know that 3.03-rc1 support single page pdf generation from images
@fritz-hh thanks for your detailed and correct analysis. I also think that the OCRmyPDF framework offers more options but requires also more resource (dependencies).
The main purpose of my contribution is to point you and other potential users to the tesseract-now-integrated (basic, single-page) pdf support - the original c't article, the follow-up articles, and also the OCRmyPDF framework were and are silent about this fact. It is understandable, because the OCRmyPDF framework was devloped earlier than tesseract's new pdf option.
In OCRmyPDF v3.0-rc2 we're now taking advantage of Tesseract's improved (single page only) PDF output to improve the overall results of OCRmyPDF. Tesseract doesn't do everything we need.
@jbarlow83 thanks for the info.
Apparently, the Tesseract PDF rendering mode (Tesseract versions > 3.02 can generate mixed-mode PDFs directly) which has been proposed in the present issue, can be achieved by starting OCRMyPDF with the commandline option
--pdf-renderer tesseract
(introduced in OCRMyPDf version 3.0) like in the example
ocrmypdf -l deu --pdf-renderer tesseract infile.pdf outfile.pdf
See https://github.com/fritz-hh/OCRmyPDF/blob/master/RELEASE_NOTES.rst . @jbarlow83 Thanks for implementing this!
[UPDATED: I removed the information about the tessedit_pdf_compression parameter, which has been recently removed in the HEAD branch version of tesseract.]
Thanks
Dear developers of OCRmyPDF, first at all, thank you for your impressive and great work.
News
In the meantime, after your publications in autumn 2013 and later in German magazine "c't", the Tesseract developers integrated a similar PDF output support into their code starting in February 2014, which makes - not in all, but in many standard cases - the OCRmyPDF framework obsolete, in my personal view.
The new feature can currently only be used if you checkout Tesseract from their source in Google git ( https://code.google.com/p/tesseract-ocr/ explains how)
Searchable PDF output is a standard feature as of Tesseract version 3.03
framework for multi-page pdfs
Tesseract cannot process multi-page PDFs as input.
Here is a framework example: