Open GoogleCodeExporter opened 9 years ago
Here is an example of the ocr output with my test language
after merging the two boxes into one and training.
Then I run the ocr on the training image itself
and it screws up an easy job, outputting "db" instead of just "d".
Where did it manufacture the extra "b" from? It's a hallucination
because it sure as heck is not on the original graphic.
Original comment by g...@folkplanet.com
on 25 Apr 2012 at 6:51
Attachments:
Here is the original test graphic that I made
when I was trying to figure out why it makes
random capitalization choices on the output letters.
I later removed what was superfluous to illustrate
the problem. Notice that the graphic contains
several other identical instances of capital-d "D"
in this graphic, but these are boxed perfectly
without splitting as desired.
Original comment by g...@folkplanet.com
on 25 Apr 2012 at 6:56
Attachments:
In case it's not obvious, my "font" that I am trying to get
to work training Tess3.01 for is a font whose upper-case
letters are all identical in shape to the lower-case letters,
only differing in size, and perhaps some incidental differences
that come from ink/paper/scanning.
Original comment by g...@folkplanet.com
on 25 Apr 2012 at 7:06
Turns out that Tess hates anything in its training data that is this:
AAAAA
AAAAA aaaaa
But this seems to work well:
Aaaaa aaaaa aaaaa. (Only one capital at the beginning of the line)
Aaaaa Aaaaa aaaaa. (if a word has a capital, it must be only one at the
beginning)
If multiple capitals appear in a single word in the training data,
it seems to really mess Tess up, and the user will get random/incorrect
capitalization in the output.
This definitely prevents Tess from using real scans as training input generally.
Original comment by g...@folkplanet.com
on 18 May 2012 at 5:25
Original issue reported on code.google.com by
g...@folkplanet.com
on 25 Apr 2012 at 6:43Attachments: