dlareklami / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

assert failed while training tesseract #975

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.tesseract -l chi_sim  test.STSONG.exp0.jpg test.STSONG.exp0 nobatch  box.train

What is the expected output? What do you see instead?
I got some failures of APPLY BOX and an assert fail. the below are the output:

Too many unichars in ambiguity on line 0
Too many unichars in ambiguity on line 0
Too many unichars in ambiguity on line 0
Too many unichars in ambiguity on line 0
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
row xheight=20.2963, but median xheight = 42.0385
row xheight=24, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
row xheight=20, but median xheight = 42.0385
FAIL!
APPLY_BOXES: boxfile line 116/果 ((129,688),(171,729)): FAILURE! Couldn't find 
a matching blob
FAIL!
APPLY_BOXES: boxfile line 119/本 ((287,687),(329,729)): FAILURE! Couldn't find 
a matching blob
FAIL!
APPLY_BOXES: boxfile line 138/: ((135,584),(141,606)): FAILURE! Couldn't find 
a matching blob
FAIL!
APPLY_BOXES: boxfile line 161/果 ((220,523),(262,564)): FAILURE! Couldn't find 
a matching blob
FAIL!
APPLY_BOXES: boxfile line 273/甲 ((27,193),(59,234)): FAILURE! Couldn't find a 
matching blob
FAIL!
APPLY_BOXES: boxfile line 284/) ((648,189),(659,230)): FAILURE! Couldn't find a 
matching blob
FAIL!
APPLY_BOXES: boxfile line 290/丙 ((21,138),(63,178)): FAILURE! Couldn't find a 
matching blob
APPLY_BOXES:
   Boxes read from boxfile:     333
   Boxes failed resegmentation:       7
   Found 326 good blobs.
   Leaving 1 unlabelled blobs in 0 words.
TRAINING ... Font name = STSONG
!isnan(Feature->Params[i]):Error:Assert failed:in file mf.cpp, line 78

What version of the product are you using? On what operating system?
tesseract 3.02.02
 leptonica-1.69
  libjpeg 6b : libpng 1.2.49 : libtiff 4.0.3 : zlib 1.2.3

os is CentOS release 6.3 (64bits)

Please provide any additional information below.

Original issue reported on code.google.com by chenjie2...@gmail.com on 1 Sep 2013 at 12:49

Attachments:

GoogleCodeExporter commented 9 years ago
I have encountered the same issue I think.

Different font size - e.g. 23px or 25px - cause no segfault.
Other fonts at 24px cause no segfault.
Generating a box file with tesseract and then calling box.train.stderr causes 
no segfault.

> tesseract -v 
tesseract 3.02.02
 leptonica-1.69
  libgif 4.1.6 : libjpeg 8d : libpng 1.6.3 : libtiff 4.0.3 : zlib 1.2.8

> gdb -ex "run xx.Tengwar_Noldor.24.tif xx.Tengwar_Noldor.24 box.train.stderr" 
-ex bt tesseract
[...]
Starting program: /usr/bin/tesseract xx.Tengwar_Noldor.24.tif 
xx.Tengwar_Noldor.24 box.train.stderr
[...]
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[...]
   Found 959 good blobs.
   Leaving 981 unlabelled blobs in 0 words.
   12 remaining unlabelled words deleted.
TRAINING ... Font name = Tengwar_Noldor
!isnan(Feature->Params[i]):Error:Assert failed:in file 
/var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/mf.c
pp, line 78

Program received signal SIGSEGV, Segmentation fault.
ERRCODE::error (this=this@entry=0x7ffff7cd8218 <_ZL13ASSERT_FAILED>, 
    caller=caller@entry=0x7ffff7a6f663 "!isnan(Feature->Params[i])", 
    action=action@entry=ABORT, 
    format=format@entry=0x7ffff7a50ea7 "in file %s, line %d")
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/ccutil/errcode.cpp:87
#0  ERRCODE::error (this=this@entry=0x7ffff7cd8218 <_ZL13ASSERT_FAILED>, 
    caller=caller@entry=0x7ffff7a6f663 "!isnan(Feature->Params[i])", 
    action=action@entry=ABORT, 
    format=format@entry=0x7ffff7a50ea7 "in file %s, line %d")
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/ccutil/errcode.cpp:87
#1  0x00007ffff79ee318 in ExtractMicros (Blob=<optimized out>, denorm=...)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/mf.cpp:78
#2  0x00007ffff79df152 in ExtractFlexFeatures (FeatureDefs=..., 
    Blob=Blob@entry=0x21e59f0, denorm=...)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/flexfx.cpp:53
#3  0x00007ffff79decfe in ExtractBlobFeatures (FeatureDefs=..., denorm=..., 
    Blob=Blob@entry=0x21e59f0)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/extract.cpp:53
#4  0x00007ffff79d5afb in LearnBlob (FeatureDefs=..., FeatureFile=0x22098a0, 
    Blob=0x21e59f0, denorm=..., BlobText=0x253b418 "k 674 926 691 951 0", 
    FontName=0x220d0e8 "Tengwar_Noldor")
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/blobclass.cpp:109
#5  0x00007ffff79d5ca3 in LearnBlob (FeatureDefs=..., filename=..., 
    Blob=Blob@entry=0x21e59f0, denorm=..., 
    BlobText=BlobText@entry=0x253b418 "k 674 926 691 951 0")
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/blobclass.cpp:99
#6  0x00007ffff79d28e4 in tesseract::Classify::LearnPieces (this=this@entry=
    0x6c7a00, filename=filename@entry=0x6dc4e8 "xx.Tengwar_Noldor.24", 
    start=start@entry=10, length=1, threshold=threshold@entry=0, 
    segmentation=segmentation@entry=tesseract::CST_WHOLE, 
    correct_text=0x253b418 "k 674 926 691 951 0", word=word@entry=0x22221d0)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/adaptmatch.cpp:438
#7  0x00007ffff79d46a4 in tesseract::Classify::LearnWord (
    this=this@entry=0x6c7a00, filename=0x6dc4e8 "xx.Tengwar_Noldor.24", 
    rejmap=<optimized out>, rejmap@entry=0x0, word=word@entry=0x22221d0)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/adaptmatch.cpp:304
#8  0x00007ffff78c2fdb in tesseract::Tesseract::ApplyBoxTraining (this=
    0x6c7a00, filename=..., page_res=<optimized out>)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/ccmain/applybox.cpp:791
#9  0x00007ffff78bd43d in tesseract::TessBaseAPI::Recognize (
    this=this@entry=0x7fffffffd780, monitor=monitor@entry=0x0)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/baseapi.cpp:738
#10 0x00007ffff78beaad in tesseract::TessBaseAPI::ProcessPage (
    this=this@entry=0x7fffffffd780, pix=0x6dca20, 
    page_index=page_index@entry=0, 
    filename=filename@entry=0x7fffffffdc96 "xx.Tengwar_Noldor.24.tif", 
    retry_config=retry_config@entry=0x0, 
    timeout_millisec=timeout_millisec@entry=0, 
    text_out=text_out@entry=0x7fffffffd730)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/baseapi.cpp:931
#11 0x00007ffff78beda2 in tesseract::TessBaseAPI::ProcessPages (
    this=this@entry=0x7fffffffd780, 
    filename=filename@entry=0x7fffffffdc96 "xx.Tengwar_Noldor.24.tif", 
    retry_config=retry_config@entry=0x0, 
    timeout_millisec=timeout_millisec@entry=0, 
    text_out=text_out@entry=0x7fffffffd730)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/baseapi.cpp:846
#12 0x0000000000401ee0 in main (argc=<optimized out>, argv=0x7fffffffd928)
    at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/tesseractmain.cpp:183

Original comment by tho...@kuehne.cn on 2 Sep 2013 at 9:13

Attachments:

GoogleCodeExporter commented 9 years ago
Issue 976 has been merged into this issue.

Original comment by zde...@gmail.com on 2 Sep 2013 at 5:46

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 2 Sep 2013 at 5:55

GoogleCodeExporter commented 9 years ago
thanks for the reply. However, I don't think my problem is the same as issue 
894 because  there's not any old tesseract in my machine. When I ran "find / 
-name libtesseract.so", only the following items were found:
/usr/local/lib/libtesseract.so
/extdisk2/tools/tesseract-ocr/api/.libs/libtesseract.so

Original comment by chenjie2...@gmail.com on 3 Sep 2013 at 12:53

GoogleCodeExporter commented 9 years ago
I only find one libtesseract so issue 894 isn't the cause.

> find / -name libtesseract*

/usr/lib64/libtesseract.so.3.0.2
/usr/lib64/libtesseract.so.3
/usr/lib64/libtesseract.so

> file /usr/lib64/libtesseract.so /usr/lib64/libtesseract.so.3 
/usr/lib64/libtesseract.so.3.0.2

/usr/lib64/libtesseract.so:                             symbolic link to 
`libtesseract.so.3.0.2'
/usr/lib64/libtesseract.so.3:                           symbolic link to 
`libtesseract.so.3.0.2'
/usr/lib64/libtesseract.so.3.0.2:                       ELF 64-bit LSB shared 
object, x86-64, version 1 (SYSV), dynamically linked, stripped

Original comment by tho...@kuehne.cn on 3 Sep 2013 at 7:38

GoogleCodeExporter commented 9 years ago
Really? Did you read comment #2[1]? Did you tried it?

[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=894#c2

Original comment by zde...@gmail.com on 4 Sep 2013 at 7:12

GoogleCodeExporter commented 9 years ago
Yes I tried to use the absolute paths and it still failed.

Please note I've used box.train on a quite a lot of images and this is the only 
one that causes a segfault.

> TESSDATA_PREFIX=/usr/share/ /usr/bin/tesseract xx.Tengwar_Noldor.24.tif
xx.Tengwar_Noldor.24 box.train.stderr
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[...]
   Found 959 good blobs.
   Leaving 981 unlabelled blobs in 0 words.
   12 remaining unlabelled words deleted.
TRAINING ... Font name = Tengwar_Noldor
!isnan(Feature->Params[i]):Error:Assert failed:in file 
/var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/mf.c
pp, line 78
Segmentation fault

Original comment by tho...@kuehne.cn on 5 Sep 2013 at 1:47

GoogleCodeExporter commented 9 years ago
Thank you,  zde...@gmail.com. After I downloaded the svn version, the assert 
failure disappeared. However, the APPLY BOX failures still exist. I've checked 
the box file many times but couldn't find anything wrong.

Original comment by chenjie2...@gmail.com on 6 Sep 2013 at 12:40

GoogleCodeExporter commented 9 years ago
@thomas@kuehne.cn:
You are not using svn version as suggested. As you can see at issues 894 and 
comment #8 here, issue is fixed there.

Original comment by zde...@gmail.com on 6 Sep 2013 at 8:07

GoogleCodeExporter commented 9 years ago
@chenjie2001

1. jpg is not good format for OCR. Your image looks like generated so 
definitely you should use different format png or tiff but without jpeg 
compression)

2. If you do training, try to use  binary images and not multicolored.

3. DPI on jpeg is 72. That is to low...

I play a little bit with your image (I binarized it, used png format, and 
changed DPI information to 92) and it works for me:

tesseract test.STSONG.exp1.png test.STSONG.exp1 box.train

Tesseract Open Source OCR Engine v3.02.03 with Leptonica
row xheight=24, but median xheight = 29.125
row xheight=22, but median xheight = 29.125
APPLY_BOXES:
   Boxes read from boxfile:     333
   Found 333 good blobs.
TRAINING ... Font name = STSONG
Generated training data for 79 words

So the result is - image pre-processing seems to be key for success.

Original comment by zde...@gmail.com on 6 Sep 2013 at 8:17

Attachments: