GerHobbelt / W

tracking bugs, caveats, reminders and ramblings in and of my public clones/forks
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

tesseract: page add image preprocessing stage: scripting? configurable? #7

Open MrBonkers opened 1 year ago

MrBonkers commented 1 year ago

tesseract (LSTM) expects a clean input image. It was found that having a noisy input image (due to JPEG compression artifacts or otherwise) is detrimental to OCR character+word recognition vs. the thresholding+segmentation stage, so we SHOULD be able to produce a "clean" (best effort) color or greyscale image as the input to the current "vanilla" tesseract process.

Hence we need a (scriptable? configurable? tweakable?) image cleaning preprocessor stage; something along the lines of unpaper, docstrum, PRlib, etc.

==> investigate further; may want to employ QuickJS as a simple scriptable "driver" of these image processing steps as we know these will need to be modified/adjusted/tweaked depending on the actual input, i.e. depending on what sort of (scanned pages carrying) PDF we are feeding the animal.

GerHobbelt commented 1 year ago

This is the idea: add a configuration parameter to tesseract off type string and have it either contain a URL pointing to a script file or a literal script itself.

Sometimes you May want the script to Return multiple results. this can be accomplished by using it in a generator Style Which means that everyone of the script produces a single result and you Carry around a Global State Which is persisted and passed to every script instantiation so It can increment State counters or other variables It needs to track to produce the generator like output sequence you Desire.

(Invoer met behulp van Google Talk diverse edits nog steeds krappy)