Tesseract (3.01) simply skips some letters during box making but complains of unlabelled words later.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. tesseract the enclosed tif (tam.TAB_Kamban_Italic.exp00.tif)for creating the 
box file (tam.TAB_Kamban_Italic.exp00.box)
2.  traing using the box file.

What is the expected output? What do you see instead?
1) when creating the box file, one expects all letters (blobs) to be boxed in; 
2) when training,  if the "originally not created" boxes are not edited, 
tesseract is not expected to complain about unlabelled words.

====================
C:\indicocr\tesseract301>tesseract tam.TAB_Kamban_Italic.exp00.tif 
tam.TAB_Kamban_Italic.exp00 nobatch box.train
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
APPLY_BOXES:
   Boxes read from boxfile:    3553
   Boxes failed resegmentation:       0
APPLY_BOXES: Unlabelled word at :Bounding box=(239,3113)->(396,3153)
APPLY_BOXES: Unlabelled word at :Bounding box=(937,3116)->(1066,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(1229,3114)->(1360,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(1380,3114)->(1518,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(1545,3116)->(1688,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(239,2661)->(396,2701)
APPLY_BOXES: Unlabelled word at :Bounding box=(937,2664)->(1066,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(1229,2662)->(1360,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(1380,2662)->(1518,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(1545,2664)->(1688,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(607,2120)->(702,2156)
APPLY_BOXES: Unlabelled word at :Bounding box=(1443,2120)->(1538,2156)
APPLY_BOXES: Unlabelled word at :Bounding box=(444,864)->(505,890)
APPLY_BOXES: Unlabelled word at :Bounding box=(576,493)->(707,530)
APPLY_BOXES: Unlabelled word at :Bounding box=(2094,493)->(2209,530)
APPLY_BOXES: Unlabelled word at :Bounding box=(611,311)->(678,349)
APPLY_BOXES: Unlabelled word at :Bounding box=(1969,311)->(2035,349)
   Found 3553 good blobs and 0 unlabelled blobs in 0 words.
   17 remaining unlabelled words deleted.
TRAINING ... Font name = TAB_Kamban_Italic
Generated training data for 675 words
================================= 

This means that tesseract while in (training pass) correctly reads that these 
blobs exist but are not labelled; but during "makebox" pass doesnot read them?

What version of the product are you using? On what operating system?
Tesseract 3.01 (windows)

Please provide any additional information below.
This has been my experience since i started using Tesseract 2.04 and 3.01 for 
tamil training. i have to use a box editor to create boxes around the skipped 
letters (as mentioned in ISSUE 664: it is not that the Tesseract reads the 
blobs wrongly but skips as if these letters are not existing)

================
aside:
at times when new boxes are created during training i get the 

Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
APPLY_BOXES: boxfile line 2835/α«òα»ìα«╖ ((2307,962),(2325,976)): 
FAILURE! Could n't find a matching blob

this doesnot happen all the times. but happens when punctuation marks are 
missed originally or mostly at the end of the lines; however that is another 
issue and i have compiled a bundle of "Could n't find a matching blob"  images 
edited so that the code team can use them to find the reason.
===========

so what is the real reson for tesseract missing the letters altogether? the 
image is computer generated (not scanned hardcopy, but made through an odt file 
to pdf to images; the resolution is set at 300 dpi and it is a 2bit image)

any guide lines for avoiding this will be useful and timesaving; as i have 
trained tamil with basic characters for 32 fonts; and now in the process of 
using multipage training to improve recognition.

regards
neelakantan
(as tif files are distrusted, i have zipped it)

Original issue reported on code.google.com by rnkan...@gmail.com on 23 Apr 2012 at 10:34

Attachments:

GoogleCodeExporter commented 9 years ago

I get this problem too.
All the time. 
Itʻs ridiculous.

Original comment by g...@folkplanet.com on 25 Apr 2012 at 3:46

GoogleCodeExporter commented 9 years ago

@galt@folkplanet.com: ridiculous is to complain without providing 
examples/tests. 
There could be different reasons why tesseract is complaining.

Original comment by zde...@gmail.com on 25 Apr 2012 at 6:58

GoogleCodeExporter commented 9 years ago

hi
in continuation of the above problem, where Tesseract skips reading some text, 
please find enclosed the following files: tam.TAMKambanWide.exp00.png and 
tam.TAMKambanWide.exp00.box.orig and tam.TAMKambanWide.exp00.box

as the name suggests, the font is a "wide" font; when the box file is created 
(the orig box file: tam.TAMKambanWide.exp00.box.orig) the boxes start from the 
middle (infact from the + sign) and not from the left & topmost of the file; 
(See below: should start from ா 239 3285 258 3302 0)
+ 613 3237 631 3249 0
அ 721 3223 737 3261 0
க்ஷு 827 3223 843 3262 0
/ 881 3228 913 3263 0
^ 1253 3232 1281 3255 0
* 1266 3285 1277 3309 0
ு 1355 3232 1384 3255 0
ம 1354 3285 1376 3309 0
` 1418 3232 1431 3256 0
@ 1413 3285 1459 3317 0
- 1475 3232 1478 3234 0
0 1465 3237 1487 3256 0
ா 239 3285 258 3302 0

the box file is edited and the revised file (tam.TAMKambanWide.exp00.box) is 
used for training. but segmentation error follows as given below:
==
C:\indicocr\tesseract301>tesseract test.png test -l tam batch.nochop makebox
Tesseract Open Source OCR Engine v3.01 with Leptonica

C:\indicocr\tesseract301>tesseract tam.TAMKambanWide.exp00.png 
tam.TAMKambanWide.exp00 nobatch box.train
Tesseract Open Source OCR Engine v3.01 with Leptonica
APPLY_BOXES: boxfile line 16/! ((1314,3284),(1318,3309)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES: boxfile line 39/[ ((674,3224),(684,3262)): FAILURE! Couldn't find 
a matching blob
APPLY_BOXES: boxfile line 41/] ((774,3224),(784,3262)): FAILURE! Couldn't find 
a  matching blob
APPLY_BOXES: boxfile line 44/| ((966,3224),(970,3262)): FAILURE! Couldn't find 
a matching blob
APPLY_BOXES: boxfile line 46/: ((1063,3232),(1068,3249)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES: boxfile line 47/' ((1107,3251),(1113,3263)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES: boxfile line 48/" ((1154,3251),(1174,3263)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES: boxfile line 51/. ((1316,3232),(1321,3235)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    1683
   Boxes failed resegmentation:       8
   Found 1675 good blobs and 0 unlabelled blobs in 0 words.
   0 remaining unlabelled words deleted.
TRAINING ... Font name = TAMKambanWide
Generated training data for 804 words
==

any solution or comments?
regards
rnkantan

Original comment by rnkan...@gmail.com on 2 May 2012 at 11:54

Attachments:

GoogleCodeExporter commented 9 years ago

hello,

I have the same problem.

I want to improve the recognition speed for OCR B font, just for digits and <> 
characters.

I use QT Box v1.08 for the bounding boxes, it seem to me that QT Box recognizes 
the characters ( or blobs ) but tesseract misses some. In my example i have 
1100 character on the page and tesseract only find 900.

I attached my files and a print screen about the issue, any help would be 
appreciated.

Original comment by kaszin...@gmail.com on 24 Oct 2012 at 10:03

Attachments:

GoogleCodeExporter commented 9 years ago

@kaszinova: Your image do not follow criteria mentioned on training wiki. 
Because of that you got error messages. 
3.02.02 version recognize 1080 of 1100 characters ;-)
If you visualize error messages you could see problem (red boxes 
errors_in_ocr.normal.exp0.png).

If I make your characters order more realistic (see ocr.normal.exp1.png & 
ocr.normal.exp1.box) tesseract 3.02 will produce no errors.

Original comment by zde...@gmail.com on 4 Jan 2013 at 11:18

Attachments:

GoogleCodeExporter commented 9 years ago

@rnkantan: Can you please try 3.02 (or better current svn code)? I tried:
    tesseract tam.TAMKambanWide.exp00.png tam.TAMKambanWide.exp00 nobatch box.train
and it worked:
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
row xheight=24, but median xheight = 17.631
row xheight=24, but median xheight = 17.631
row xheight=26, but median xheight = 17.631
row xheight=26, but median xheight = 17.631
row xheight=26, but median xheight = 17.631
row xheight=26, but median xheight = 17.631
row xheight=26, but median xheight = 17.631
row xheight=26, but median xheight = 17.631
APPLY_BOXES:
   Boxes read from boxfile:    1683
   Found 1683 good blobs.
TRAINING ... Font name = TAMKambanWide
Generated training data for 669 words

It tried:
    tesseract tam.TAB_Kamban_Italic.exp00.tif tam.TAB_Kamban_Italic.exp00 nobatch box.train
and I got:
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
row xheight=24, but median xheight = 17.5323
row xheight=26, but median xheight = 17.5323
row xheight=26, but median xheight = 17.5323
row xheight=24, but median xheight = 17.5323
APPLY_BOXES: boxfile line 1543/ே ((2149,1949),(2178,1980)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES: boxfile line 3552/-! ((367,277),(376,301)): FAILURE! Couldn't find 
a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    3553
   Boxes failed resegmentation:       2
APPLY_BOXES: Unlabelled word at :Bounding box=(239,3113)->(396,3153)
APPLY_BOXES: Unlabelled word at :Bounding box=(937,3116)->(1066,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(1229,3114)->(1360,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(1380,3114)->(1518,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(1545,3116)->(1688,3151)
APPLY_BOXES: Unlabelled word at :Bounding box=(239,2661)->(396,2701)
APPLY_BOXES: Unlabelled word at :Bounding box=(937,2664)->(1066,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(1229,2662)->(1360,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(1380,2662)->(1518,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(1545,2664)->(1688,2699)
APPLY_BOXES: Unlabelled word at :Bounding box=(607,2120)->(702,2156)
APPLY_BOXES: Unlabelled word at :Bounding box=(1443,2120)->(1538,2156)
APPLY_BOXES: Unlabelled word at :Bounding box=(444,864)->(505,890)
APPLY_BOXES: Unlabelled word at :Bounding box=(576,493)->(707,530)
APPLY_BOXES: Unlabelled word at :Bounding box=(2094,493)->(2209,530)
APPLY_BOXES: Unlabelled word at :Bounding box=(611,311)->(678,349)
APPLY_BOXES: Unlabelled word at :Bounding box=(1969,311)->(2035,349)
   Found 3551 good blobs.
   Leaving 6 unlabelled blobs in 0 words.
   17 remaining unlabelled words deleted.
TRAINING ... Font name = TAB_Kamban_Italic
Generated training data for 674 words
And I think that these errors are correct (e.g. you need to fix box file)

Original comment by zde...@gmail.com on 4 Jan 2013 at 11:40

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 4 Feb 2013 at 10:03

Changed state: WorksForMe

RaghavBhardwaj / tesseract-ocr

Tesseract (3.01) simply skips some letters during box making but complains of unlabelled words later. #687