I work with tesseract 3.02.02 on SUSE Linux 13.2
the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes between
the letters.
though these are far smaller than the other letters, they are interpreted as
normal letters.
The normal letters are recognized fairly good
as an example:
the picture appended is translated to the text
15 Ellser Exdmsund Mögsgzerg
Is there a possibility to give parameters to tesseract that it
. either should neglect letters which do not fit the majority of the other
letters,
. or it should only use letters in a given range of size
. or to firstly make the boxes,
then correct the boxes, by hand or program,
finally translate using the corrected boxes
I have already tried with a config-file containing
textord_min_xheight 26
textord_xheight_mode_fraction 0.9
textord_xheight_error_margin 0.1
textord_descx_ratio_min 0.3
textord_descx_ratio_max 0.6
textord_ascx_ratio_min 1.3
textord_ascx_ratio_max 1.7
load_system_dawg F
load_freq_dawg F
it changes some things but nothing to neglect the points and strokes
I also tried to make the boxes, correct them by erasing the false letters
and then translate with these boxes by using a config file containing:
tessedit_make_boxes_from_boxes T
but this doesnt what i want.
Is there a poosibility to accomplish this?
a solution with a dictionary is not possible, because the text consists of only
names of persons and locations.
Another thing i wonder is:
when i ocr an image from .tiff to .txt
and makebox of the same image
some (few) letters are different recognized!
thanks for help in advance
Original issue reported on code.google.com by pj...@aon.at on 19 Apr 2015 at 12:54
Original issue reported on code.google.com by
pj...@aon.at
on 19 Apr 2015 at 12:54Attachments: