Open GoogleCodeExporter opened 9 years ago
you could save a lot of words if you attach your picture(s) and exact commands
you uses. Developers can not test your description - just images...
Original comment by zde...@gmail.com
on 25 Apr 2012 at 6:53
I don't actually know why Tess does this, but Tess 2.04 does it too.
I have fixed the problem by using training data of this format,
being careful to try to have only one capital letter per line,
making the lines not too long and not too short,
using natural-looking puncutation.
If more than one capital letter is used in the line,
it must be at the beginning of a word and nowhere else.
Both of these kinds of lines will screw up the training:
AAAAAAAA (all caps causes incorrect upper/lower-case failures in ocr output)
AAAAAAAA aaaaaaa (runs of more than one cap in a word create same failure)
Aaaaaa aaaaa aaaaa (works ok - I have had one good success with this
approach)
Aaaaaa Aaaaa (seems to work)
Original comment by g...@folkplanet.com
on 18 May 2012 at 5:18
The issue is that when Tesseract interprets a dot as a period, it seems to have
a rule that overrides the case of the next input letter it gets to,
capitalizing it even when the letter is unambiguously lower case in the source
image.
Tesseract makes the wrong decision. In input text where the distinction
between . and , is ambiguous, it stupidly* overrides the case of the subsequent
letter.
Fix it so that it takes the hint from the unambiguously lower case following
letter that the punctuation mark is in fact a comma, not a period, although it
may look like a period.
*Stupid by the following rationale:
1. We do not need the OCR engine to take the place of an editor. The source
texts are well formed and I do not need any logic in the engine to change the
case of letters because it thinks it saw a period and the developers think they
are more clever than professional editors and proofreaders.
2. The visual information contained in . and , is much less than information
content of e and E, and the difference between the two elements of the former
set is proportionally less than that of the latter. Therefore, when we can
deduce the identity of one from the other, it is much smarter to take the clue
from what we see in the latter set. It is good to use the rules of grammar and
style to our advantage, but the developers have applied this rule bass-ackwards.
Original comment by silentpl...@gmail.com
on 16 Sep 2012 at 10:59
Original issue reported on code.google.com by
g...@folkplanet.com
on 25 Apr 2012 at 4:08