fcheng00 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 1 forks source link

Add Rupee symbol to Hindi and other Indic languages #1329

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run tesseract with hindi traineddata on an image that has the Indian Rupee 
symbol
2.
3.

What is the expected output? What do you see instead?
This is a new symbol and is not recognized.

What version of the product are you using? On what operating system?
latest version from git
msys2 on windows8

Please provide any additional information below.

"On 10 August 2010, the Unicode Technical Committee accepted the proposed code 
position U+20B9 ₹ indian rupee sign (HTML: ₹ graphic:Indian Rupee 
symbol.svg).[26] The character has been encoded in the Unicode 6.0, and named 
distinctly from the existing character U+20A8 ₨ rupee sign (HTML: ₨), 
which will continue to be available as the generic rupee sign."

Attached is a sample devanagari text file with the rupee symbol.

Original issue reported on code.google.com by shreeshrii on 8 Oct 2014 at 3:01

Attachments:

GoogleCodeExporter commented 9 years ago
Probably also needs to be added to English as part of currency symbols - it is 
supported on windows in all english fonts 

Original comment by shreeshrii on 8 Oct 2014 at 3:12

GoogleCodeExporter commented 9 years ago
Created a desired_characters file and added it to each of:
asm ben bih hin mar nep san bod dzo guj kan mal ori pan sin tam tel
That doesn't mean we can train in all of them, but if/when we can, the rupee 
will be there!
Particularly, the Hindi training process is currently broken, so the sign will 
most likely not be in 3.04 for Hindi.

Original comment by theraysm...@gmail.com on 4 Nov 2014 at 9:15

GoogleCodeExporter commented 9 years ago
Disappointed that it may not be in 3.04 for Hindi.

It should also be added to English - along with all other currency symbols.

http://en.wikipedia.org/wiki/Sinhala_language 
http://en.wikipedia.org/wiki/Sri_Lankan_rupee

Sri Lankan Rupee is different from indian Rupee. So, the sign could be added as 
part of support of for all currency signs like $ etc.

Original comment by shreeshrii on 5 Nov 2014 at 3:54

GoogleCodeExporter commented 9 years ago
Similarly for Nepali

http://en.wikipedia.org/wiki/Nepalese_rupee

Original comment by shreeshrii on 5 Nov 2014 at 6:17

GoogleCodeExporter commented 9 years ago
dzo - 
http://en.wikipedia.org/wiki/Dzongkha
http://en.wikipedia.org/wiki/Bhutanese_rupee
http://en.wikipedia.org/wiki/Bhutanese_ngultrum

For nep, sin and dzo (languages of neighboring countries of India)
Indian Rupee symbol can be included if it is supported in their fonts.

--
Sorry for the confusion caused because of my use of 'Indic' when I should have 
said 'Indian'.

Original comment by shreeshrii on 5 Nov 2014 at 6:56