barum / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Error: Size of unicharset of mftraining is greater than MAX_NUM_CLASSES for "mftraining", fatal #137

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When I tried to train a Chinese font, I guess this:

$ mftraining uming-total.tr 
Reading uming-total.tr ...
Error: Size of unicharset of mftraining is greater than MAX_NUM_CLASSES

There supposed to be 24255 unicharset in the font.

The good thing is that "cntraining" chowed the .tr file with problem.

So tesseract is still not unicode ready.

Original issue reported on code.google.com by hash...@gmail.com on 5 Jul 2008 at 12:29

GoogleCodeExporter commented 9 years ago
Increase the MAX_NUM_CLASSES to 0x10FFFF (unicode codespace) actually get 
mftraining
finish the job.

Original comment by hash...@gmail.com on 5 Jul 2008 at 1:21

GoogleCodeExporter commented 9 years ago
Of cause tesseract segfault when actually ocr a sample.

The resulting inttemp is 153MB.

Original comment by hash...@gmail.com on 5 Jul 2008 at 1:31

GoogleCodeExporter commented 9 years ago
it segfault at classify/adaptmatch.cpp:600   Results.BlobLength = MAX_INT32;

Apparently, it cannot hold that much classes in the memory. maybe the members
  CLASS_ID Classes[MAX_NUM_CLASSES];
  FLOAT32 Ratings[MAX_CLASS_ID + 1];
  uinT8 Configs[MAX_CLASS_ID + 1];
of ADAPT_RESULTS should be dynamically allocate to the current 
unicharset.size().

Anyway, the ocr result is verybad. It seems that 
1. Chinese is double width;
2. there might be small gaps inside a single glyph, 
3. there are not extra white space between word. There is only char space, no 
word space.

As a result, the segmentation on Chinese is very bad with tesseract.

Original comment by hash...@gmail.com on 5 Jul 2008 at 2:38

GoogleCodeExporter commented 9 years ago
There will be some improvment in this direction in 3.00.

Original comment by theraysm...@gmail.com on 14 Nov 2008 at 5:46

GoogleCodeExporter commented 9 years ago
Works well for CHinese in 3.00

Original comment by theraysm...@gmail.com on 21 May 2010 at 11:30