Tamil training; revising existing trained data

What steps will reproduce the problem?
1.New training using (tesseract 3.01) with the enclosed tif, box files
2.using the new tam.traineddata file on the same tif
3.

What is the expected output? What do you see instead?
a standard output is expected; instead tesseract produces an unsatisfactory 
text which however is explained.
What version of the product are you using? On what operating system?

Tesseract 3.01, windows portable version in windows xp (32bit) with sp3.

Please provide any additional information below.

Since the featured/existing tam.traineddata is not satisfactory(requires 
post-processing which is not existing as of now) and defective(missing some 
characters), i am planning to train tesseract3.01 as per the instructions 
(sketchy for an uninitated person like me) available in the site.

I have trained the enclosed tif file -one page- to produce the enclosed box 
file edited manually; and produced the necessary tr file and unichar files. 
(they are also enclosed; the unichar file is different from version 2 which i 
used to manually edit to identify characters and numbers; here the file looks 
different from the Tesseract3 training samples, but i edited the same to 
identify the characters, punctuation and non-punctuation marks)

then mftraining and cntraining are done; and the tessdata_combine is invoked.
=======================================
C:\indicocr\tesseract301>combine_tessdata tam.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 17376
Offset for type 4 is 2311873
Offset for type 5 is 2315461
Offset for type 6 is -1
Offset for type 7 is 2359668
Offset for type 8 is -1
Offset for type 9 is 2370166
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
==============

now using the tesseract to ocr the tif file,

C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif testtxt -l tam
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
=================================================
however the txt file is different; the main reason for unsastifactory result 
being
a) ா mistaken as places for ர் and ஈ(though the image is quite clear)
b) ெ being read individually as பி or ஙி; 
c)ே is read as (;
d)ை is never read correctly; always read as ள

it is to be noted that the vowel transition symbols ா, ெ, ே, ை, ௗ are 
not trained in this but the combined letters are trained (once only: கெ, 
கே, கொ, கோ, கௌ etc)
==============
now my questions:

q1; is the output not unsatisfactory because the training is done with only one 
set and each character is trained only once;
q2: why not provide samples of the following english training files before 
tessdata_combine, (if they are already available where are they?)
•tessdata/eng.config 
•tessdata/eng.unicharset 
•tessdata/eng.unicharambigs 
•tessdata/eng.inttemp 
•tessdata/eng.pffmtable 
•tessdata/eng.normproto 
•tessdata/eng.punc-dawg 
•tessdata/eng.word-dawg 
•tessdata/eng.number-dawg 
•tessdata/eng.freq-dawg 
==
q3: is there a way to uncombine the traineddata file (using sqfs etc?) to get 
the original files?

q4: while using the exisiting tam.traineddata (see the enclosed ouput 
page.txt), the text is ready correctly but processed wrongly; how 
post-processing can be done; i can give the rules; can sombody write the 
necessary code?

q5: can somebody elaborate the real meaning of the font_properties file? 
==quote:
Each line of the font_properties file is formatted as follows: 
<fontname> <italic> <bold> <fixed> <serif> <fraktur>where <fontname> is a 
string naming the font (no spaces allowed!), and <italic>, <bold>, <fixed>, 
<serif> and <fraktur> are all simple 0 or 1 flags indicating whether the font 
has the named property. 
===quote ends
isitalic, isbold; does it mean the characters in the training tif is italic or 
bold; or does it mean the font can also be italic and bold; (what i mean is the 
most TTF fonts have regular, italic, bold etc as the features); should i train 
the same font for italic and bold separately in a separate page of the tif? or 
should i simply mark 1 or both italic and bold to say that the font also can 
have these additional characterestics?

q6: multipage tif: 
==quote:
Clarification for large amounts of training data The 32 images limit is for the 
number of FONTS. Each font should be put in a single multi-page tiff (only if 
you are using libtiff!) and the box file can be modified to specify the page 
number for each character after the coordinates. Thus an arbitrarily large 
amount of training data may be created for any given font, allowing training 
for large character-set languages. An alternative to multi-page tiffs is to 
create many single-page tiffs for a single font, and then you must cat together 
the tr files for each font into several single-font tr files. In any case, the 
input tr files to mftraining must each contain a single font. ===quote ends
i sm planning a multipage tif with the same Latha font; can i use Latha bold in 
one page, Latha italic in one page; also can i have alternate sizes (10, 12, 
16) in the same font; one page containing one size? 
q7: what exactly is meant by exp(num) from user view point? does it mean the 
multipages for a given font? if the tam.latha.exp0 is multipage already how the 
pages are mentioned?

regards
rnkantan
Original issue reported on code.google.com by rnkan...@gmail.com on 29 Mar 2012 at 7:52
Attachments:
RaghavBhardwaj / tesseract-ocr

Tamil training; revising existing trained data #668