This is an R interface to the tesseract OCR (Optical Character Recognition) system.
tesseract is available at https://code.google.com/p/tesseract-ocr/.
More recent versions are available on github https://github.com/tesseract-ocr/tesseract
Installing tesseract involves first installing leptonica http://www.leptonica.com/.
This is currently a basic interface to the essential functionality, with some added R functionality to visualize the results.
We can machine generate the interface to the other methods and classes in the tesseract API/library.
Often we will start with a scanned document already as a single image. Assuming leptonica was installed with support for that image format, we can read the image directly.
In many of our use cases, we start with a PDF document that consists of multiple scanned pages. Each page is a scanned image. Tesseract/leptonica does not read this directly. Instead, we need to convert the PDF document into a different format. We ue ImageMagick, and specifically its very general and powerful convert command, to convert between image formats.
If we want to create a separate image for each page in the original PDF, we can use the script pdf2png in this package (inst/scripts/pdf2png). This hides some of the details of convert. (This can convert to JPEG and other formats, in spite of what the name suggests.)
pdf2png SMITHBURN_1952.pdf
This will generate png files with names SMITHBURN_1952_0000.png, ... We can specify the filename format.
We can also specify the density (points per pixel), the quality/level of compression, and any other
command line arguments convert
supports.
Alternatively, we can convert the PDF document to a multi-page/image TIFF file, i.e. a single TIFF
file that contains multiple images. We then read this into R using the readMultipageTiff()
function and then access each page from the resulting list.
To convert a multipage PDF document to a multipage TIFF file, use, e.g.,
convert SMITHBURN_1952.pdf SMITHBURN_1952.tiff
We - Matt Espe & Duncan Temple Lang - started developing this package in April 2015.