incorrect display using kan.traineddata (downloaded from svn)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.O:\recovered files\drive M\rao-files\chilume\test-3.02-r806>tesseract 
rao1.tif test -l kan
Error: unichar |:|0n2 in normproto file is not in unichar set.
Error: unichar |:|1n2 in normproto file is not in unichar set.
Error: unichar |!|0n2 in normproto file is not in unichar set.
Error: unichar |!|1n2 in normproto file is not in unichar set.
Error: unichar |;|0n2 in normproto file is not in unichar set.
Error: unichar |;|1n2 in normproto file is not in unichar set.
Error: unichar |α▓░α▓é|0n2 in normproto file is not in unichar set.
Error: unichar |α▓░α▓é|1n2 in normproto file is not in unichar set.
Error: unichar |α▓░α▓┐α▓é|0n2 in normproto file is not in unichar 
set.
Error: unichar |α▓░α▓┐α▓é|1n2 in normproto file is not in unichar 
set.
Error: unichar |%|0n3 in normproto file is not in unichar set.
Error: unichar |%|1n3 in normproto file is not in unichar set.
Error: unichar |%|2n3 in normproto file is not in unichar set.
Error: unichar |α▓░α│Çα▓é|0n3 in normproto file is not in unichar 
set.
Error: unichar |α▓░α│Çα▓é|1n3 in normproto file is not in unichar 
set.
Error: unichar |α▓░α│Çα▓é|2n3 in normproto file is not in unichar 
set.
Error: unichar |α▓▓α▓é|0n2 in normproto file is not in unichar set.
Error: unichar |α▓▓α▓é|1n2 in normproto file is not in unichar set.
Tesseract Open Source OCR Engine v3.02 with Leptonica
O:\recovered files\drive M\rao-files\chilume\test-3.02-r806>

2.
3.

What is the expected output? What do you see instead?
error as follows:
O:\recovered files\drive M\rao-files\chilume\test-3.02-r806>tesseract rao1.tif 
test -l kan
Error: unichar |:|0n2 in normproto file is not in unichar set.
Error: unichar |:|1n2 in normproto file is not in unichar set.
Error: unichar |!|0n2 in normproto file is not in unichar set.
Error: unichar |!|1n2 in normproto file is not in unichar set.
Error: unichar |;|0n2 in normproto file is not in unichar set.
Error: unichar |;|1n2 in normproto file is not in unichar set.
Error: unichar |α▓░α▓é|0n2 in normproto file is not in unichar set.
Error: unichar |α▓░α▓é|1n2 in normproto file is not in unichar set.
Error: unichar |α▓░α▓┐α▓é|0n2 in normproto file is not in unichar 
set.
Error: unichar |α▓░α▓┐α▓é|1n2 in normproto file is not in unichar 
set.
Error: unichar |%|0n3 in normproto file is not in unichar set.
Error: unichar |%|1n3 in normproto file is not in unichar set.
Error: unichar |%|2n3 in normproto file is not in unichar set.
Error: unichar |α▓░α│Çα▓é|0n3 in normproto file is not in unichar 
set.
Error: unichar |α▓░α│Çα▓é|1n3 in normproto file is not in unichar 
set.
Error: unichar |α▓░α│Çα▓é|2n3 in normproto file is not in unichar 
set.
Error: unichar |α▓▓α▓é|0n2 in normproto file is not in unichar set.
Error: unichar |α▓▓α▓é|1n2 in normproto file is not in unichar set.
Tesseract Open Source OCR Engine v3.02 with Leptonica
O:\recovered files\drive M\rao-files\chilume\test-3.02-r806>

Unable to understand the errors since the all are encrypted script not in 
kannada script. I like to know what exact error( in kan.script) generated? How 
to extract all files from the kan.traineddata file? Whether file unicharambig 
file also included in the traineddata?

What version of the product are you using? On what operating system?
OS= winXPwith sp3  
tesseract-OCR r-O:\recovered files\drive 
M\rao-files\chilume\test-3.02-r806>tesseract -v
tesseract 3.02
 leptonica-1.68 (Mar 14 2011, 10:43:03) [MSC v.1500 LIB Release 32 bit]
  libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5

Please provide any additional information below.

Original issue reported on code.google.com by withbles...@gmail.com on 27 Nov 2012 at 12:21

Attachments:

rao1.tif

GoogleCodeExporter commented 9 years ago

indeed there is no need to use real image. It looks like something is broken in 
kan.traineddata:
  tesseract test test -l kan
Error: unichar |:|0n2 in normproto file is not in unichar set.
Error: unichar |:|1n2 in normproto file is not in unichar set.
Error: unichar |!|0n2 in normproto file is not in unichar set.
Error: unichar |!|1n2 in normproto file is not in unichar set.
Error: unichar |;|0n2 in normproto file is not in unichar set.
Error: unichar |;|1n2 in normproto file is not in unichar set.
Error: unichar |ರಂ|0n2 in normproto file is not in unichar set.
Error: unichar |ರಂ|1n2 in normproto file is not in unichar set.
Error: unichar |ರಿಂ|0n2 in normproto file is not in unichar set.
Error: unichar |ರಿಂ|1n2 in normproto file is not in unichar set.
Error: unichar |%|0n3 in normproto file is not in unichar set.
Error: unichar |%|1n3 in normproto file is not in unichar set.
Error: unichar |%|2n3 in normproto file is not in unichar set.
Error: unichar |ರೀಂ|0n3 in normproto file is not in unichar set.
Error: unichar |ರೀಂ|1n3 in normproto file is not in unichar set.
Error: unichar |ರೀಂ|2n3 in normproto file is not in unichar set.
Error: unichar |ಲಂ|0n2 in normproto file is not in unichar set.
Error: unichar |ಲಂ|1n2 in normproto file is not in unichar set.
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Cannot open input file: test

Original comment by zde...@gmail.com on 30 Nov 2012 at 10:19

Changed state: Accepted
Added labels: Component-Persistence, OpSys-All, Type-Defect

GoogleCodeExporter commented 9 years ago

I tried to extract kan.unicharambigs from kan.traineddata file as follows:
extract of CMD is reproduced below:
P:\recovered files\drive M\NewFolder>combine_tessdata -u tessdata/kan.traineddat
a ./kan.
Extracting tessdata components from tessdata/kan.traineddata
Wrote ./kan.config
Wrote ./kan.unicharset
Wrote ./kan.unicharambigs
Wrote ./kan.inttemp
Wrote ./kan.pffmtable
Wrote ./kan.normproto
Wrote ./kan.punc-dawg
Wrote ./kan.word-dawg
Wrote ./kan.number-dawg
Wrote ./kan.freq-dawg
Wrote ./kan.shapetable
P:\recovered files\drive M\NewFolder>
I am shocked to notice that kan.unicharambigs contains only English script 
instead of kannada script . I had expected kan.unicharambigs will contain 
kannada script only. In this context kan.unicharset file also attached for 
perusal.  I am not getting output 100% accuracy so far.

Original comment by withbles...@gmail.com on 8 Dec 2012 at 6:52

Attachments:

GoogleCodeExporter commented 9 years ago

is their any updates for above issue? I am facing same issue.

Original comment by saurabh....@gmail.com on 23 Sep 2014 at 9:32

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Attaching the kannada dangambigs file (from SriRanga ji), which could be used 
as basis for creating a unicharambigs file for kannada.

Original comment by shreeshrii on 18 Feb 2015 at 8:43

Attachments:

kan.DangAmbigs.txt

ecit241 / tesseract-ocr

incorrect display using kan.traineddata (downloaded from svn) #801