fritz-hh / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
260 stars 31 forks source link

MRC #88

Closed v217 closed 8 years ago

v217 commented 10 years ago

Hi, especially for scans integration with jbig2enc for better compression of the textimage layer would make this software perfect.

jbarlow83 commented 10 years ago

pdfbeads (a ruby project) attempts to do that although it has issues with aligning the hidden OCR text layer with the image and some crash bugs, and the documentation is mainly in Russian.

I've looked into making the changes for OCRmyPDF. It would be a major overhaul/rewrite and would call for a new PDF generation backend.

v217 commented 10 years ago

jbig2enc itself is quite stable, now it recognises also quite well the resolution of the images. There is also support for basic foreground background separation. There's a one page script in python for generation of multilayer pdf for an earlier version of jbig2enc. When this script was written the recognition of the resolution of the pdf still did not work reliably. In short for scans jbig2 is a must, but on linux this is still not available.

v217 commented 10 years ago

If one is willing to use more than one graphics library leptonica written in c for jbig2enc and for text foreground and background separation gamera written in python for didjvu all the ingredients are already there and well tested.