akorentlab / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Chopper index fix #736

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Any training data which generates more than 200 words when running
 tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train

What is the expected output?
A good .tr file

What do you see instead?
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const: 
Assertion `*unichar_repr != '\0'' failed.

What version of the product are you using? On what operating system?
tesseract v3.01, Linux (but any will cause this)

Please provide any additional information below.
Not sure why the conversion in 'modify_blob_choice' is necessary, but the 
'contains' function throws if the first character is '\0' (which technically 
many allowable UTF-8 characters over ASCII would but...)

Anyway here's a workaround for now.

Original issue reported on code.google.com by pddf...@gmail.com on 26 Jul 2012 at 10:50

Attachments:

GoogleCodeExporter commented 9 years ago
1. Please can you try current svn (version 3.02)?
2. Can you please also attach example files for testing (zzz.ocra.exp0.tif 
zzz.ocra.exp0.box)

Original comment by zde...@gmail.com on 27 Jul 2012 at 6:43

GoogleCodeExporter commented 9 years ago
Unfortunately in v3.02 the same issue exists (same code there, so same effect 
:) ). I would like to send the image but I cannot due to work issues. I suspect 
however if you used a few patch files as samples, you too could generate more 
than 200 training words in a page.

BTW There's a typo in the patch I sent (at least if you care about backwards 
compatibility to check your vectors) where line 16-18 should read
+    chop_index += '0' - 1;
+  else
+    chop_index += 'A' - 11;

It's interesting as the INVALID_UNICHAR_ID code (and comment) must have been 
from before the assert().

Original comment by pddf...@gmail.com on 27 Jul 2012 at 12:42

GoogleCodeExporter commented 9 years ago
I can not reproduce it (openSUSE 12.1):
$ tesseract  slk.cambria.exp001.tif  slk.cambria.exp001 nobatch box.train
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
APPLY_BOXES: boxfile line 537/— ((189,2623),(237,2628)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    2018
   Boxes failed resegmentation:       1
   Found 2017 good blobs and 0 unlabelled blobs in 0 words.
   0 remaining unlabelled words deleted.
TRAINING ... Font name = cambria
Generated training data for 433 words

Original comment by zde...@gmail.com on 27 Jul 2012 at 11:24

Attachments:

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r742.

Original comment by theraysm...@gmail.com on 21 Sep 2012 at 3:19