[FEATURE] OCR - Incorrect Recognition

MorganGrundy commented 3 years ago

Is your feature request related to a problem? Please describe. Text that is recognised incorrectly by the OCR.

Describe the solution you'd like Give tesseract a dictionary. During post-processing the user can correct any incorrect recognition.

Additional context Tesseract apparently has a default dictionary but it doesn't seem to force recognised words.

MorganGrundy commented 3 years ago

8ec7a4ca0dc485097c029504819bf48000312123 Cannot seem to get user-words or user-patterns to work. From research I can see a bunch of people having similar problems. They apparently worked in original tesseract but then were no longer supported with LSTM mode. Most discussions about the problem are closed with links to tesseract doc on how to use user-words and user-patterns but still it doesn't work.

MorganGrundy commented 3 years ago

Since I cannot get the tesseract dictionary to work I will instead use my own dictionary correction algorithm.

Load a dictionary from a file. Compare OCR results with dictionary, calculating a confidence score for each word. If the word with the highest confidence score exceeds a threshold then auto-correct OCR result. When multiple words share a highest confidence then choose one at random and offer the alternatives to user in review stage. If no words exceed the confidence threshold then in review stage notify user to review result (maybe offer list of highest confidence corrections). Review stage should also allow user to add words to dictionary.

MorganGrundy / MenuParseroo

[FEATURE] OCR - Incorrect Recognition #3