Closed GoogleCodeExporter closed 9 years ago
Increase the MAX_NUM_CLASSES to 0x10FFFF (unicode codespace) actually get
mftraining
finish the job.
Original comment by hash...@gmail.com
on 5 Jul 2008 at 1:21
Of cause tesseract segfault when actually ocr a sample.
The resulting inttemp is 153MB.
Original comment by hash...@gmail.com
on 5 Jul 2008 at 1:31
it segfault at classify/adaptmatch.cpp:600 Results.BlobLength = MAX_INT32;
Apparently, it cannot hold that much classes in the memory. maybe the members
CLASS_ID Classes[MAX_NUM_CLASSES];
FLOAT32 Ratings[MAX_CLASS_ID + 1];
uinT8 Configs[MAX_CLASS_ID + 1];
of ADAPT_RESULTS should be dynamically allocate to the current
unicharset.size().
Anyway, the ocr result is verybad. It seems that
1. Chinese is double width;
2. there might be small gaps inside a single glyph,
3. there are not extra white space between word. There is only char space, no
word space.
As a result, the segmentation on Chinese is very bad with tesseract.
Original comment by hash...@gmail.com
on 5 Jul 2008 at 2:38
There will be some improvment in this direction in 3.00.
Original comment by theraysm...@gmail.com
on 14 Nov 2008 at 5:46
Works well for CHinese in 3.00
Original comment by theraysm...@gmail.com
on 21 May 2010 at 11:30
Original issue reported on code.google.com by
hash...@gmail.com
on 5 Jul 2008 at 12:29