AmitGorvadiya / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

utf-8 string too long at line 699 - Kannada #215

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.
2. tesseract.log filed herewith - which is self explanatory.
3.

What is the expected output? What do you see instead?
 don't know why error  "utf-8 string too long at the line 699" generated

What version of the product are you using? On what operating system?
tesseract 2.04 winxpwithsp3

Please provide any additional information below.
Aser training wiki 
"If you need a description longer than 8 bytes, please file an issue."
hence filed here.

Original issue reported on code.google.com by withbles...@gmail.com on 6 Jul 2009 at 6:57

Attachments:

GoogleCodeExporter commented 9 years ago
As per tesseract.log
utf-8 string too long at line 699= How to locate line 699 in the box file or 
image file 
APPLY_BOXES: Unlabelled word blk:1 row:12 allrows:12 = how to locate blk:1/row:
12/allrows:12 in the boxfile or image file?  (using irfanview and paintbrush)
 valuable guidance  is requested

Original comment by withbles...@gmail.com on 6 Jul 2009 at 7:15

GoogleCodeExporter commented 9 years ago

Original comment by withbles...@gmail.com on 6 Jul 2009 at 7:24

Attachments:

GoogleCodeExporter commented 9 years ago
I have updated the wiki to cover the new limit, which is 24 bytes.
Your box file contains the following (hex unicodes) at line 699:
ca4 ccd ca4 ccd caf ca8 ccd caf ccb (x 3 bytes = 27 total)
Is this really a single syllable? It doesn't look right to me as there is no 
virama 
between the caf and ca8, so it looks like 2 syllables.
You can go to line 699 by opening the file in VC++ and typing ctrl-g folloowed 
by the 
line number.

Original comment by theraysm...@gmail.com on 6 Jul 2009 at 5:02

GoogleCodeExporter commented 9 years ago
box file opened in VC++ and using CTrl+g then typed 699 - it pointed to 
"ಸ್ಯಾನ್ಯಃ 1024
538 1114 576" With help of http://rishida.net/scripts/uniview/conversion.php the
following particulars of unicodes for ಸ್ಯಾನ್ಯಃ. for "ca4 ccd 
ca4 ccd caf ca8 ccd caf
ccb"{ತ್ತ್ಯನ್ಯೋ 909 536 1009 576}particulars of unicodes noted 
below
-----------------
"ಸ್ಯಾನ್ಯಃ 1024 538 1114 576" [ಸ್ಯಾ ನ್ಯಃ]
0CB8  ಸ  KANNADA LETTER SA
  0CCD  ್  KANNADA SIGN VIRAMA
  0CAF  ಯ  KANNADA LETTER YA
  0CBE  ಾ  KANNADA VOWEL SIGN AA
  0CA8  ನ  KANNADA LETTER NA
  0CCD  ್  KANNADA SIGN VIRAMA
  0CAF  ಯ  KANNADA LETTER YA
  0C83  ಃ  KANNADA SIGN VISARGA
  0020     SPACE

---------------------------------------------------
"ca4 ccd ca4 ccd caf ca8 ccd caf ccb"{ತ್ತ್ಯನ್ಯೋ 909 536 1009 
576}[ತ್ತ್ಯ ನ್ಯೋ]
    ತ   U+0CA4:   KANNADA LETTER TA   (Kannada)

    ್   U+0CCD:   KANNADA SIGN VIRAMA   (Kannada) 

    ತ   U+0CA4:   KANNADA LETTER TA   (Kannada)

    ್   U+0CCD:   KANNADA SIGN VIRAMA   (Kannada)

    ಯ   U+0CAF:   KANNADA LETTER YA   (Kannada)
-------
    ನ   U+0CA8:   KANNADA LETTER NA   (Kannada)    

    ್   U+0CCD:   KANNADA SIGN VIRAMA   (Kannada)

    ಯ   U+0CAF:   KANNADA LETTER YA   (Kannada)

    ೋ   U+0CCB:   KANNADA VOWEL SIGN OO   (Kannada)

Original comment by withbles...@gmail.com on 6 Jul 2009 at 6:30