dlareklami / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tess3.01 recognizes letters fairly well but the capitalization is completely random. #691

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Train with letters pasted carefully from a good scan.
2. Watch Tess fail to distinguish upper and lower case,
outputting them at random, and with no consistency.

What is the expected output? What do you see instead?

 Using jTessBoxEditor and a font somewhat similar to mine
 I have gotten Tess3.01 to work, so it's very frustrating
 that I can't get my training images to work.

What version of the product are you using? On what operating system?
 Tess 3.01 on Windows.

Please provide any additional information below.

I have training data in 600 dpi, from my scans of the same. it doesn't work.
I have tried 300dpi versions, and it made no difference.
I have tried removing all diacriticals, and it made no difference.
I have tried supplying 3 extra additional pages of training material, and it 
made no difference.
I have tried various psm options, and it made no difference.
I tried training directly off a scanned page that had been carefully cleaned 
up, in case my hand-pasted training letters were not perfect 
in vertical alignment or horizontal spacing evenness. It did not help.

I have seen dozens of cases of freaky unexplained box/blob finding behavior by 
Tess3.01. Missing blobs and complaining about it later when 
they were absolutely solidly there.  Insisting in splitting characters
incorrectly, but just sometimes.  Other instances of that character,
which were absolutely identical by cut and paste, would be ok.

Similar to its inability to distinguish upper and lower case,
as if it has lost all sense of scale, it would find tiny dots
(really small, often just 1 pixel) on the scan and 
interpret them as periods,
even though the real period on my 600 dpi monochrome images
occupies an area at least 25x larger. 

It's also having trouble telling a comma from an apostrophe,
as if it had found the right shape but had lost all
context for where it was placed on the line.

Original issue reported on code.google.com by g...@folkplanet.com on 25 Apr 2012 at 4:08

GoogleCodeExporter commented 9 years ago
you could save a lot of words if you attach your picture(s) and exact commands 
you uses. Developers can not test your description - just images...

Original comment by zde...@gmail.com on 25 Apr 2012 at 6:53

GoogleCodeExporter commented 9 years ago
I don't actually know why Tess does this, but Tess 2.04 does it too.

I have fixed the problem by using training data of this format,
being careful to try to have only one capital letter per line,
making the lines not too long and not too short,
using natural-looking puncutation.
If more than one capital letter is used in the line,
it must be at the beginning of a word and nowhere else.
Both of these kinds of lines will screw up the training:
AAAAAAAA     (all caps causes incorrect upper/lower-case failures in ocr output)
AAAAAAAA aaaaaaa   (runs of more than one cap in a word create same failure)
Aaaaaa aaaaa aaaaa    (works ok - I have had one good success with this 
approach)
Aaaaaa Aaaaa          (seems to work)

Original comment by g...@folkplanet.com on 18 May 2012 at 5:18

GoogleCodeExporter commented 9 years ago
The issue is that when Tesseract interprets a dot as a period, it seems to have 
a rule that overrides the case of the next input letter it gets to, 
capitalizing it even when the letter is unambiguously lower case in the source 
image.

Tesseract makes the wrong decision.  In input text where the distinction 
between . and , is ambiguous, it stupidly* overrides the case of the subsequent 
letter.

Fix it so that it takes the hint from the unambiguously lower case following 
letter that the punctuation mark is in fact a comma, not a period, although it 
may look like a period.  

*Stupid by the following rationale:
1.  We do not need the OCR engine to take the place of an editor.  The source 
texts are well formed and I do not need any logic in the engine to change the 
case of letters because it thinks it saw a period and the developers think they 
are more clever than professional editors and proofreaders.

2.  The visual information contained in . and , is much less than information 
content of e and E, and the difference between the two elements of the former 
set is proportionally less than that of the latter.  Therefore, when we can 
deduce the identity of one from the other, it is much smarter to take the clue 
from what we see in the latter set.  It is good to use the rules of grammar and 
style to our advantage, but the developers have applied this rule bass-ackwards.

Original comment by silentpl...@gmail.com on 16 Sep 2012 at 10:59