Training a new font is obscenely difficult

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Look at the training page
2. Gasp

It's really hard to train a new font right now. It's one of the mosts hacky, 
ridiculous processes I've done in a long while. Reminds me of my days trying to 
make X.org work on a laptop back around 2001. That's how bad it was. The Web 
abounds with people having difficulty doing this, and it took me several long, 
boring, frustrating days to get this working properly.

I'm naive, but I see no reason why a system with a font file and imagemagick 
can't do almost everything in the training instructions automatically.

From my reading, it's basically the following process:
 - take a font
 - make some training images
 - tell tesseract about the content of those images
 - make a bunch of data files
 - put them in the right place

Why can't the process for the user be something like:
 - train_tesseract new-font.ttf -l eng

That would in turn:
 - use imagemagick, the font, and some standard (customizable) training text to make a good training image
 - automatically correct the box file using the text it used to make the training image
 - feed the box file back into Tesseract, make all the correct data files, combine them and then put them in the tessdata directory
 - delete any messes left behind

Original issue reported on code.google.com by mliss...@michaeljaylissner.com on 13 Feb 2012 at 7:59

GoogleCodeExporter commented 9 years ago

Wow, it is so easy! Why did you not program it, yet?
This is open source program - so everybody can contribute. I am waiting for 
your training program desperately.

Original comment by zde...@gmail.com on 13 Feb 2012 at 11:17

GoogleCodeExporter commented 9 years ago

Excellent idea. Thanks to Zde, project Member - for his support. Yes I am also 
waiting for your wonderful training program which is boon for  users. Kindly 
start to build program without further delay.Wishing you all the best of Good 
Luck in your Good mission.

Original comment by withbles...@gmail.com on 13 Feb 2012 at 4:30

GoogleCodeExporter commented 9 years ago

@mlissner: there were such attempts 
http://code.google.com/p/tesseractindic/source/browse/#svn/trunk/tesseract_train
er - I tried to improve it see 
https://github.com/zdenop/tesseract-auto-training, but I come back for training 
from real scans...

IMHO most difficult (in current process as described on wiki) is to create GOOD 
input images (with box files). Other steps can be done with simple script (see 
https://github.com/paalberti/tesseract-dan-fraktur and there 
https://github.com/paalberti/tesseract-dan-fraktur/blob/master/swe-frak/buildscr
ipt.sh)

Original comment by zde...@gmail.com on 13 Feb 2012 at 8:55

GoogleCodeExporter commented 9 years ago

As a new user approaching Tesseract, those scripts look great. Unfortunately, I 
didn't find them when I needed them, and they're probably out of date anyway 
since they're not built into Tesseract. Is there a reason we can't provide 
something like these to make the process easier? 

The only step I see that needs human work is adjusting the box files. The rest 
seems like it should be done by a computer, and even adjusting the box files 
can be made pretty easy if we have a script that can merge the locations from 
the box file (roughly) with the letters from an input file.

Also - if the challenge is to create GOOD input images, that seems like another 
reason to build this into Tesseract itself, so that such images can be created 
by a computer, not by scanning/manipulating in iterative and hacky ways.

Original comment by mliss...@michaeljaylissner.com on 13 Feb 2012 at 10:06

GoogleCodeExporter commented 9 years ago

Lots of times, people don't have the desirable fonts; all they have is some 
scanned images of old documents they want to digitize. If you have the fonts, 
you can use jTessBoxEditor to generate TIFF/Box files suitable for training 
with Tesseract. Once you get a good set of them, you can use train.ps1 to 
automate generation of language data files.

Original comment by nguyen...@gmail.com on 15 Feb 2012 at 1:31

GoogleCodeExporter commented 9 years ago

The reason for scanning images and not generating from fonts is because the 
classifier needs to be robust to the sorts of distortions that happen when 1) 
text is printed; and 2) when that text is scanned. 

Generating images with such distortions is an extremely under-researched area - 
I was only able to find one (1) paper on the topic. OCRopus had an 
implementation of the techniques presented in that paper, once upon a time, but 
I think it has been rewritten twice since then, and that component was not 
included in either rewrite.

Google have some software to do this, but it's written to target their internal 
facilities and needs to be rewritten to work anywhere else. They are going to 
release it, but there's no definite timeline (I was told about this in 2010). I 
presume it includes a distortion component, but never asked when I had the 
chance. In any case, (IIRC) it was used to generate the language data in the 
3.x series.

ImageJ has a component to generate a distortion model from a pair of images, 
that might be useful in the meantime. 

(Oh, and if you think the tesseract training documentation is scary, don't ever 
look at the opencv documentation :)

Original comment by joregan on 23 Feb 2012 at 11:47

Changed state: Started

GoogleCodeExporter commented 9 years ago

Balthazar Rouberol created such tool as requested by reporter - see 
https://github.com/BaltoRouberol/TesseractTrainer
so I am closing this issue...

Original comment by zde...@gmail.com on 30 Jul 2012 at 11:59

Changed state: No-longer-an-issue

0amitkumar0 / tesseract-ocr

Training a new font is obscenely difficult #620