AmitGorvadiya / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Support for multiple dictionaries #181

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I use Tesseract on pages with various kind of information like names and
special expressions on it.
I get that the three current dictionaries are enough to recognize a simple
text in a certain language. However with multiple dictionaries one could
group words together and with a simple IsValidWord receive what kind of
information the output belongs to.
I implemented that feature for my specific problem so it's hardcoded, like
the existing dictionaries are. It is now easy for me to tell for instance
first names and last names apart, since I have two separate dictionaries
for them.
So a configurable number of dictionaries (either word list or dawg tree)
would be great. I would also help or even code most of it myself when I've
got some time.

Is this a interesting feature or is there no demand for something like that?

Original issue reported on code.google.com by bkne...@ethz.ch on 9 Jan 2009 at 1:02

GoogleCodeExporter commented 9 years ago
Something like this may be in 3.00 or 3.01.

Original comment by theraysm...@gmail.com on 10 Mar 2009 at 8:48

GoogleCodeExporter commented 9 years ago

withblessings@gmail.com
with reference to:"I get that the three current dictionaries are enough to 
recognize
a simple
text in a certain language" - interested to know which certain language?

Original comment by withbles...@gmail.com on 20 Jun 2009 at 10:05

GoogleCodeExporter commented 9 years ago
That doesn't matter, does it? English, German, French etc.. What I meant is, 
that I
rather would prefer word groups instead of languages. Like this, you could
distinguish family names with first names or city names.
You can also just recognize ONE single language and not several languages in one
text, which is very common. Being able to distinguish various languages in a 
single
text could be quite useful.

Original comment by bkne...@ethz.ch on 20 Jun 2009 at 1:31

GoogleCodeExporter commented 9 years ago
Very interesting feature. Since I am curious to know how your programs works, 
as such
is it possible to send your program to me - for which I shall be thankful to 
you?In
fact I wanted to experiment for Kannada lang.

Original comment by withbles...@gmail.com on 20 Jun 2009 at 2:58

GoogleCodeExporter commented 9 years ago
please have a look at current svn version (r671). There is possibility to use 
more language files (=> also dictionaries):

    tesseract eurotext.tif eurotext -l eng+deu+spa+fra

Original comment by zde...@gmail.com on 11 Feb 2012 at 9:02