baopham1340 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Bihari Training text not representative #1347

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. see 
https://code.google.com/p/tesseract-ocr/source/browse/bih/bih.training_text?repo
=langdata
2.
3.

What is the expected output? What do you see instead?
It only has names of months, dates and time

What version of the product are you using? On what operating system?
latest version on git

Please provide any additional information below.

More realistic training sample can be taken from http://bh.wikipedia.org/
This in in bhojpuri - one of the bihari languages 

(see http://en.wikipedia.org/wiki/Bihari_languages)

Original issue reported on code.google.com by shreeshrii on 19 Oct 2014 at 5:14

GoogleCodeExporter commented 9 years ago
Also see http://www.ntm.org.in/languages/maithili/faqs_ntm.asp
for maithili - another one of bihari languages

Original comment by shreeshrii on 19 Oct 2014 at 5:21

GoogleCodeExporter commented 9 years ago
for maithili dictionary - also see

http://www.videha.com/2009/08/blog-post_19.html

http://www.ignca.nic.in/coilnet/kalyani.htm

Original comment by shreeshrii on 19 Oct 2014 at 5:36

GoogleCodeExporter commented 9 years ago
Attached is a sample bihari languages wordlist that could be used for training.

Original comment by shreeshrii on 27 Oct 2014 at 9:09

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Thanks for pointing that out!

Original comment by theraysm...@gmail.com on 4 Nov 2014 at 10:22

GoogleCodeExporter commented 9 years ago
Larger wordlist for training could be taken from the Bihari dictionary file

http://sanskritdocuments.org/hindi/hunspell

Original comment by shreeshrii on 5 Nov 2014 at 4:37

GoogleCodeExporter commented 9 years ago
Moved to github: https://github.com/tesseract-ocr/langdata/pull/11

Original comment by joregan on 14 May 2015 at 12:48