Tess3.01 box around d incorrect - and just one of many similar problems with the boxing.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. take the given image (bad4.tif) and run makebox
 tesseract bad4.tif bad4 batch.nochop makebox

2. see output has put two boxes for a character that should have one.
I usually use jTessBoxEditor to look at them.

3. It should show just one box for the "D" capitol-d
character, not be split as "T)".  

What is the expected output? What do you see instead?

It should make just one character there, with one box,
not two. There is not good reason to split it.
Apparently Tess thinks that this is a T).

However, if I paste copies of the same "D" character
in other locations, those even though they consist
of the same exact pixels, are correctly interpreted.

If I try to merge those 2 boxes and use jTessBoxEditor
to join them and do a tess training, it doesn't really
work. Because when I try to recognize the training
data on my new "language", it actually just outputs
two characters anyway instead of one.

Apparently what is happening is that Tess is merrily
making the boxes first, and then asking what language
is this afterwards.  Because only at the end after
making two boxes where one should be does it try
to recognize which characters are actually in my set.
Since my simple training set did not contain any ")" or "T",
it output "Db" instead of just "D".

You can get an idea of what's happening like this:

 tesseract bad4.tif bad4 -l mynewlanguage batch.nochop makebox

look at the boxes and they are split and read the wrong text
as mentioned above.  This incorrect boxes match the regular
output ocr text obtained as mentioned above.

What version of the product are you using? On what operating system?

Tess3.01 Windows 7

Please provide any additional information below.

I have this fantasy that once you fix the bugs,
not only will Tess work, but it will be more accurate 
than ever.

Original issue reported on code.google.com by g...@folkplanet.com on 25 Apr 2012 at 6:43

Attachments:

GoogleCodeExporter commented 9 years ago

Here is an example of the ocr output with my test language
after merging the two boxes into one and training.
Then I run the ocr on the training image itself
and it screws up an easy job, outputting "db" instead of just "d".
Where did it manufacture the extra "b" from?  It's a hallucination
because it sure as heck is not on the original graphic.

Original comment by g...@folkplanet.com on 25 Apr 2012 at 6:51

Attachments:

bad4.ocr.txt

GoogleCodeExporter commented 9 years ago

Here is the original test graphic that I made
when I was trying to figure out why it makes
random capitalization choices on the output letters.
I later removed what was superfluous to illustrate
the problem.  Notice that the graphic contains
several other identical instances of capital-d "D"
in this graphic, but these are boxed perfectly
without splitting as desired.

Original comment by g...@folkplanet.com on 25 Apr 2012 at 6:56

Attachments:

GoogleCodeExporter commented 9 years ago

In case it's not obvious, my "font" that I am trying to get 
to work training Tess3.01 for is a font whose upper-case
letters are all identical in shape to the lower-case letters,
only differing in size, and perhaps some incidental differences
that come from ink/paper/scanning.

Original comment by g...@folkplanet.com on 25 Apr 2012 at 7:06

GoogleCodeExporter commented 9 years ago

Turns out that Tess hates anything in its training data that is this:
AAAAA 
AAAAA aaaaa

But this seems to work well:
Aaaaa aaaaa aaaaa.   (Only one capital at the beginning of the line)
Aaaaa Aaaaa aaaaa.   (if a word has a capital, it must be only one at the 
beginning)

If multiple capitals appear in a single word in the training data,
it seems to really mess Tess up, and the user will get random/incorrect
capitalization in the output. 

This definitely prevents Tess from using real scans as training input generally.

Original comment by g...@folkplanet.com on 18 May 2012 at 5:25

0amitkumar0 / tesseract-ocr

Tess3.01 box around d incorrect - and just one of many similar problems with the boxing. #692