tesseract: page add image preprocessing stage: scripting? configurable?

GerHobbelt / W

tracking bugs, caveats, reminders and ramblings in and of my public clones/forks

BSD 3-Clause "New" or "Revised" License

2 stars 1 forks source link

tesseract (LSTM) expects a clean input image. It was found that having a noisy input image (due to JPEG compression artifacts or otherwise) is detrimental to OCR character+word recognition vs. the thresholding+segmentation stage, so we SHOULD be able to produce a "clean" (best effort) color or greyscale image as the input to the current "vanilla" tesseract process.

Hence we need a (scriptable? configurable? tweakable?) image cleaning preprocessor stage; something along the lines of unpaper, docstrum, PRlib, etc.

==> investigate further; may want to employ QuickJS as a simple scriptable "driver" of these image processing steps as we know these will need to be modified/adjusted/tweaked depending on the actual input, i.e. depending on what sort of (scanned pages carrying) PDF we are feeding the animal.

GerHobbelt / W

tesseract: page add image preprocessing stage: scripting? configurable? #7