JohnWang0512 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Chinese OCR improvement using character frequency database #721

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. The Chinese OCR success rate is about 90%

What is the expected output?
# users expect better results. 

What version of the product are you using? On what operating system?
# I'm on Android, using the apps "OCR test" 
https://play.google.com/store/apps/details?id=edu.sfsu.cs.orange.ocr

Please provide any additional information below.
# Errors may be avoid by coupling up with frequency dataset.
# Film subtitle based: SUBTLEX-CH (Cai & Brysbaert 2010) 
http://expsy.ugent.be/subtlex-ch/
# Newspapers based : DA (2005:Modern Chinese Character Frequency List) 
http://lingua.mtsu.edu

Original issue reported on code.google.com by hugo....@gmail.com on 18 Jun 2012 at 12:06

GoogleCodeExporter commented 9 years ago
are that sources ready to release their frequency datasets under Apache License 
2.0 (compatible)?

Original comment by zde...@gmail.com on 3 Aug 2012 at 6:53

GoogleCodeExporter commented 9 years ago
Agreed that something needs to be done in this area for the non-space-delimited 
languages. That includes Chinese, Japanese, Korean, Thai at least.

Original comment by theraysm...@gmail.com on 4 Nov 2014 at 6:48

GoogleCodeExporter commented 9 years ago
Hi, May I ask a question? Does it work if I use Chinese character frequency 
list to set load_system_dawg? I mean a Chinese character is just a word, while 
English words contain several letters.

Original comment by wenhuac...@gmail.com on 21 Apr 2015 at 8:45