JohnWang0512 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Need Help: OCRA font for English for simple numeric glyphs #627

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I have a blob of numbers, in OCRA font, that I want to recognize.  Other fonts 
such as Arial, Times New Roman, Courier, & Palatino work fine for recognizing 
the numeric glyphs.  Ironically, OCRA which was designed to assure accuracy in 
optical character recognition is failing using a standard Tesseract install.

My problem is that OCRA font was used on output that now needs to be optically 
recognized.  So, I figured I needed to train Tesseract to handle OCRA font 
numeric glyphs.  I'm unable to successfully train Tesseract and need help or a 
pointer on where I ran afoul.

Attached (OCRA_numbers_variety.png) is a sample PNG file showing a variety of 
fonts, but most importantly a sample of OCRA font for the number set 0..9.

Tesseract's attempt to recognized a combination of characters, "0123456789", 
results in: ULE3H5E?Bq

Here is what I did in an attempt to train tesseract.

I created a two samples in Open Office.  The first was a single line with 
"0123456789" on it. The Second was each number on its own line so there is a 
column of 0-9.  I printed to Adobe Acrobat at 400 dpi.  In Acrobat, I exported 
the images at 400 dpi to PNG images.  

On a Gentoo Linux box, I did the following, using Sample set 2 for the 

#
# create the box file
#
tesseract eng.ocra.exp2.png eng.ocra.exp2 batch.nochop makebox

jlpoole@hermes ~/work/tess/samples $ tesseract eng.ocra.exp2.png eng.ocra.exp2 
batch.nochop makebox
Tesseract Open Source OCR Engine v3.02 with Leptonica
jlpoole@hermes ~/work/tess/samples $ cat eng.ocra.exp2.box
U 322 4027 348 4068 0
L 322 3957 348 3998 0
E 322 3888 348 3929 0
3 322 3818 348 3859 0
H 323 3749 347 3790 0
5 322 3679 348 3720 0
E 322 3610 348 3651 0
? 322 3541 348 3582 0
B 322 3471 348 3512 0
q 322 3402 348 3443 0
jlpoole@hermes ~/work/tess/samples $
#
# edit the box file to correct the character
# after the edits:
#
jlpoole@hermes ~/work/tess/samples $ nano eng.ocra.exp2.box
jlpoole@hermes ~/work/tess/samples $ cat eng.ocra.exp2.box
0 322 4027 348 4068 0
1 322 3957 348 3998 0
2 322 3888 348 3929 0
3 322 3818 348 3859 0
4 323 3749 347 3790 0
5 322 3679 348 3720 0
6 322 3610 348 3651 0
7 322 3541 348 3582 0
8 322 3471 348 3512 0
9 322 3402 348 3443 0
jlpoole@hermes ~/work/tess/samples $

#
# run in the training mode, "Run Tesseract for Training"
#
tesseract eng.ocra.exp2.png eng.ocra.exp2 nobatch box.train

jlpoole@hermes ~/work/tess/samples $ tesseract eng.ocra.exp2.png eng.ocra.exp2 
nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
   Boxes read from boxfile:      10
   Found 10 good blobs.
TRAINING ... Font name = ocra
Generated training data for 1 words
jlpoole@hermes ~/work/tess/samples $

#
# "Compute the Character Set"
#
unicharset_extractor eng.ocra.exp2.box 

jlpoole@hermes ~/work/tess/samples $ unicharset_extractor eng.ocra.exp2.box
Extracting unicharset from eng.ocra.exp2.box
Wrote unicharset file ./unicharset.
jlpoole@hermes ~/work/tess/samples $ cat unicharset
11
NULL 0 NULL 0
0 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0      # 0 [30 ]0
1 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0      # 1 [31 ]0
2 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0      # 2 [32 ]0
3 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0      # 3 [33 ]0
4 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 5 0 0      # 4 [34 ]0
5 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 6 0 0      # 5 [35 ]0
6 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 7 0 0      # 6 [36 ]0
7 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 8 0 0      # 7 [37 ]0
8 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 9 0 0      # 8 [38 ]0
9 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 10 0 0     # 9 [39 ]0
jlpoole@hermes ~/work/tess/samples $
#
# Create file "font_properties"
#
jlpoole@hermes ~/work/tess/samples $ cat font_properties
ocra 0 0 1 0 0

jlpoole@hermes ~/work/tess/samples $

#
# Run MF Training, "Clustering" Step 1: mftraining
#
mftraining -F font_properties -U unicharset -O eng.unicharset eng.ocra.exp2.tr 

jlpoole@hermes ~/work/tess/samples $ mftraining -F font_properties -U 
unicharset -O eng.unicharset eng.ocra.exp2.tr
Read shape table shapetable of 0 shapes
Reading eng.ocra.exp2.tr ...
Warning: no protos/configs for 0 in CreateIntTemplates()
Warning: no protos/configs for 1 in CreateIntTemplates()
Warning: no protos/configs for 2 in CreateIntTemplates()
Warning: no protos/configs for 3 in CreateIntTemplates()
Warning: no protos/configs for 4 in CreateIntTemplates()
Warning: no protos/configs for 5 in CreateIntTemplates()
Warning: no protos/configs for 6 in CreateIntTemplates()
Warning: no protos/configs for 7 in CreateIntTemplates()
Warning: no protos/configs for 8 in CreateIntTemplates()
Warning: no protos/configs for 9 in CreateIntTemplates()
Done!
jlpoole@hermes ~/work/tess/samples $
#
# cntraining, "Clustering" Step 2: cntraining
#
cntraining eng.ocra.exp2.tr 

jlpoole@hermes ~/work/tess/samples $ cntraining eng.ocra.exp2.tr
Reading eng.ocra.exp2.tr ...
Clustering ...

Writing normproto ...
jlpoole@hermes ~/work/tess/samples $
#
# Was a file "unicharambigs" created?
# conclusion: no
#

jlpoole@hermes ~/work/tess/samples $ ls uni*
unicharset
jlpoole@hermes ~/work/tess/samples $

#
#  "Putting It Altogether"
#
combine_tessdata eng.

jlpoole@hermes ~/work/tess/samples $ combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is -1
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is -1
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
jlpoole@hermes ~/work/tess/samples $

#
# try 
#

tesseract OCRA_numbers_variety.png output -l eng

jlpoole@hermes ~/work/tess/samples $ cat output.txt
0123456789 Aï¬al

0123456789 Tnnes

0123456789 Courier
0123456789 Courier SWA
0123456789 Palatino

0123456789 Djvu Sans Mono
ULE3H5E?Bq OCRA

ULE3H5b?Bq OCRA-A -Std

jlpoole@hermes ~/work/tess/samples $

#
# Not being root for final combination affect outcome?
# Conclusion: no.
#

jlpoole@hermes ~/work/tess/samples $ su
Password:
hermes samples # /usr/local/bin/combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is -1
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is -1
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
hermes samples #  tesseract OCRA_numbers_variety.png output -l eng
bash: tesseract: command not found
hermes samples # /usr/local/bin/tesseract OCRA_numbers_variety.png output -l eng
Tesseract Open Source OCR Engine v3.02 with Leptonica
hermes samples # cat output.txt
0123456789 Aï¬al

0123456789 Tnnes

0123456789 Courier
0123456789 Courier SWA
0123456789 Palatino

0123456789 Djvu Sans Mono
ULE3H5E?Bq OCRA

ULE3H5b?Bq OCRA-A -Std

hermes samples #

It looks like something went wrong at the MF Training Step 1, as indicated by 
the warnings.

Original issue reported on code.google.com by jlpool...@gmail.com on 18 Feb 2012 at 7:46

Attachments:

GoogleCodeExporter commented 9 years ago
I neglected to copy three generated files so they have an "eng" prefix:

cp normproto eng.normproto
cp inttemp eng.inttemp
cp pffmtable eng.pffmtable

There was no Microfeat in my directory, so I concluded it is not needed. After 
creating these prefixed fileds, I reran the combine command.

I also determined that I had to deploy the eng.traineddata to 
/usr/local/share/tessdata (after copy the existing eng.traineddata that came 
with tesseract to preserve a working solution).  After deploying 
eng.traineddata, I got an an error as follows: 

jlpoole@hermes ~/work/tess/samples $ tesseract OCRA_numbers_variety.png output 
-l eng
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const: 
Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@hermes ~/work/tess/samples $

Original comment by jlpool...@gmail.com on 19 Feb 2012 at 12:18

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
tested under tesseract 3.02. attached files which are self explanatory.
It is observed there are misspelling in the name of font in the output text - 
even though box file contains correct spelling.Successfully trained Tesseract 
to handle OCRA font numeric glyphs except english glyphs. I don't know whether 
the expectation of poster is fulfilled.

Original comment by withbles...@gmail.com on 26 Feb 2012 at 12:13

Attachments:

GoogleCodeExporter commented 9 years ago
Since Issue #629 embodies the same problem identified in this Issue #627, I'm 
considering this issue closed and am pursuing the matter concerning tesseract 
3.02. [Version 681] in Issue #629.  I updated my version of tesseract to 
today's build and I still had problems.  Reference should be made to Issue #629 
unless someone advises otherwise.  

Thank you.

Original comment by jlpool...@gmail.com on 26 Feb 2012 at 7:37

GoogleCodeExporter commented 9 years ago
reg:"I still had problems" -please elaborate/explain in detail what exact 
problems still existed. I like to test after downloading the latest version 
r-683 in WinXp.Upload sample text- based on which I can generate tif/box files 
myself for testing purpose
and feedback.

Original comment by withbles...@gmail.com on 27 Feb 2012 at 4:10

GoogleCodeExporter commented 9 years ago
When I tried to run tesseract againt a newly built traindata (build 681) I got 
this error message instead of output:

jlpoole@themis ~/work/tess/samples_b681 $ tesseract num.ocra.exp0.png output -l 
num
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const: 
Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@themis ~/work/tess/samples_b681 $

Original comment by jlpool...@gmail.com on 27 Feb 2012 at 5:01

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
@jlpoole56:

if you have still this problem, please post your files.

Original comment by zde...@gmail.com on 10 May 2012 at 6:30

GoogleCodeExporter commented 9 years ago
I solved my problem in a later bug where I posted a perl script that can be 
used to train.  This bug may be closed.

Original comment by jlpool...@gmail.com on 10 May 2012 at 3:24

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 10 May 2012 at 4:29