AiPacino / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
2 stars 0 forks source link

shapeclustering/mftraining error #1110

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Using the provided files I run either

shapeclustering -F font_properties -U unicharset datamouse.jbcgulliver.exp0.tr

or

mftraining -F font_properties -U unicharset -O datamouse.unicharset 
datamouse.jbcgulliver.exp0.tr

and the following error is given:

Warning: No shape table file present: shapetable
Reading datamouse.jbcgulliver.exp0.tr ...
Font id = -1/0, class id = 1/76 on sample 0
font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file 
trainingsampleset.cpp, line 622
Abort trap: 6

There are some similar issues posted, but I have followed all provided 
information and have tried to repeat the training process with clean files, 
several times, but without success.

What is the expected output? What do you see instead?
I expect the proper output files to be generated but instead an error is thrown.

What version of the product are you using? On what operating system?
Tesseract 3.02 (obtained using MacPorts)
Mac OSX 10.8.4

Please provide any additional information below.

Thanks for any help.

Original issue reported on code.google.com by jstak...@gmail.com on 13 Feb 2014 at 2:58

Attachments:

GoogleCodeExporter commented 9 years ago
Message is clear: No shape table file present: shapetable
So shapeclustering did not create needed file

You uploaded file newlang.jbcgulliver.exp0.tr, but you wrote you run command:
shapeclustering -F font_properties -U unicharset datamouse.jbcgulliver.exp0.tr

"Font id = -1/0" indicates that your font_properties is not correct. Check 
Requirements_for_text_input_files[1] once again.

[1] 
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Requirements_for
_text_input_files

Original comment by zde...@gmail.com on 13 Feb 2014 at 9:18

GoogleCodeExporter commented 9 years ago
I repeated the process again, double-checking the file requirements and my 
command inputs, but I am still receiving the same error.

1) tesseract datamouse.jbcgulliver.exp0.png datamouse.jbcgulliver.exp0 
batch.nochop makebox
2) tesseract datamouse.jbcgulliver.exp0.png datamouse.jbcgulliver.exp0 box.train
output: 
APPLY_BOXES:
   Boxes read from boxfile:    2694
   Found 2694 good blobs.
TRAINING ... Font name = jbcgulliver
Generated training data for 639 words
3) unicharset_extractor datamouse.jbcgulliver.exp0.box
output:
Wrote unicharset file ./unicharset.
4) shapeclustering -F font_properties -U unicharset 
datamouse.jbcgulliver.exp0.tr
output:
Reading datamouse.jbcgulliver.exp0.tr ...
Font id = -1/0, class id = 1/76 on sample 0
font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file 
trainingsampleset.cpp, line 622
Abort trap: 6

Please see attached files

Original comment by jstak...@gmail.com on 13 Feb 2014 at 5:25

Attachments:

GoogleCodeExporter commented 9 years ago
I just discovered the issue. There appears to be an error in the documentation?

For me to get the files working the names MUST be of the format

[lang].[fontname].exp[num].tr

and the name in the font_properties file MUST be JUST the font name.

I was confused because in the documentation it states that 
:
The name of the .tr file may be either fontname.tr or 
[lang].[fontname].exp[num].tr and fontname.tr did not appear to work for me. 

But when I switched the format back to [lang].[fontname].exp[num].tr I followed 
the line in the documentation that "each .tr filename must match an entry in 
the font_properties file" so in this case it would have been 
[lang].[fontname].exp[num] in font_properties, but that did not work either.

In summary, the only formatting that worked for me was:
[lang].[fontname].exp[num].tr (not [fontname].tr !)
fontname (not filename!) in font_properties

Original comment by jstak...@gmail.com on 13 Feb 2014 at 5:38