List of languages in development

rkcosmos commented 4 years ago

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

Arabic (DONE, August, 5 2020)
Uyghur (DONE, August, 5 2020)
Persian (DONE, August, 5 2020)
Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

Serbian-latin (DONE, July,12 2020)
Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

Hindi (DONE, July,24 2020)
Marathi (DONE, July,24 2020)
Nepali (DONE, July,24 2020)
Rajasthani (NEED HELP)
Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

Russian (DONE, July,29 2020)
Serbian-cyrillic (DONE, July,29 2020)
Bulgarian (DONE, July,29 2020)
Ukranian (DONE, July,29 2020)
Mongolian (DONE, July,29 2020)
Belarusian (DONE, July,29 2020)
Tajik (DONE, April,20 2021)
Kyrgyz (NEED HELP)

Group 5

Telugu (DONE, November,17 2020)
Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

Tamil (DONE, August, 10 2020)
Hebrew (ready to train)
Malayalam (ready to train)
Bengali + Assamese (DONE, August, 23 2020)
Punjabi (ready to train)
Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

Japanese version 2 (DONE, March, 21 2021)+ vertical text
Chinese version2 (DONE, March, 21 2021)+ vertical text
Korean version 2(DONE, March, 21 2021)
Latin version 2 (DONE, March, 21 2021)
Math + Greek?
Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.
In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

yossibiton commented 4 years ago

Why won't you share the training code, so people could train the model by themself ?

madrugado commented 4 years ago

For group 4 you could add Ukrainian, Bulgarian, and may be Mongolian, although it is not Slavic it uses Cyrillic script.

edloginova commented 4 years ago

Do you plan to only work with human languages? It would be amazing to add a model to recognize mathematical formulas.

manohar-cyber commented 4 years ago

I guess Tamil, Telugu can be added to one group because they belongs to a language group called 'Dravidian'. Meaning they relate to each other in terms of grammar, word arrangement.Two other popular( in India) language, which belong to that family can also be added to that group— Kannada and Malayalam (For further info— https://en.m.wikipedia.org/wiki/Dravidian_languages). Moreover Telugu and kannada share some common alphabets and words. I will be adding alphabet and words of kannada language for language request. Great project, keep it up👍

bgmastermind commented 4 years ago

For Group 4 Bulgarian dict bg.txt char bg_char.txt

upadhyayprakash commented 4 years ago

I'd highly recomend supporting Devanagiri Script (Wiki - https://en.wikipedia.org/wiki/Devanagari), which is the fourth most widely adopted writing system in the world. Please go through the wikipedia link to understand its wide spread usage across most Ancient Languages including Sanskrit, Hindi, Marathi, Awadhi, Haryanvi.

I see you have included "Hindi" as a target language, which of course, is the most spoken language in the Indian Subcontinent.

If you could let me know what's the current word-count you have (maybe share the "dict" & "alphabets" directory), I can continue with the research to share more details about the Language as it's my First Language.

Hindi has 47 primary alphabets (including 14 Vowels & 33 Consonants).

You can contact me @ prakash.upadhyay93@yahoo.com

arashjafari commented 4 years ago

Can i help for the Persian (Farsi) language ? I can supply some popular words and characters

@rkcosmos

junaidgirkar commented 4 years ago

Can i contribute in any way. I am fluent in Hindi alongside English. Also I may be of help in the programming section. I know Python, C and Java in languages. Am good in front-end with HTML, CSS and JavaScript (basic).

manmeet3591 commented 4 years ago

I recommend adding Punjabi language which is the 10th most spoken language around the world. pb_char.txt

rkcosmos commented 4 years ago

@edloginova After doing human language, we can explore math as well.

@upadhyayprakash Lists are here easyocr/character and easyocr/dict

@arashjafari looks like we already have both words and char. You can recheck if everything is alright.

@junaidgirkar sounds good, I'll keep in mind. May call you for help.

Vijayabhaskar96 commented 4 years ago

Why won't you share the training code, so people could train the model by themself ?

I agree with this, If the training code and sample dataset are provided, many can train the model for their language. With free GPU services like Google colab and Kaggle Kernels, anyone can train them online and contribute much faster.

rkcosmos commented 4 years ago

Why won't you share the training code, so people could train the model by themself ?

@yossibiton @Vijayabhaskar96 because it's still not straightforward training process. Even I have to think carefully when creating model. Will share later when it's clean. Please don't pressure me, I am doing a lot of work for free.

rahilwazir commented 4 years ago

@rkcosmos Can we add support for the language Urdu? It is very similar with Persian and Arabic (not much complexities of arabic though).

Vijayabhaskar96 commented 4 years ago

Why won't you share the training code, so people could train the model by themself ?

@yossibiton @Vijayabhaskar96 because it's still not straightforward training process. Even I have to think carefully when creating model. Will share later when it's clean. Please don't pressure me, I am doing a lot of work for free.

Sorry I made you feel this way, I didn't mean to pressurize you. I just wanted to help. Take your time, you're doing great work!

fnasim commented 4 years ago

@rkcosmos For Group 1, could you please add Urdu to that group? Urdu is very similar to Arabic and Persian and I've just submitted the PR for the character list and a dictionary. So it should be ready to go!

cc: @rahilwazir

loayamin commented 4 years ago

This might help for Arabic:

https://github.com/OSINTAI/Arabic_Words

sardasumit commented 4 years ago

i added Marathi character and dictionary data set file please train it mr.txt

rkcosmos commented 4 years ago

i added Marathi character and dictionary data set file please train it mr.txt

@sardasumit did you forget a link for mr_char.txt?

sardasumit commented 4 years ago

i added Marathi character and dictionary data set file please train it mr.txt

@sardasumit did you forget a link for mr_char.txt?

@rkcosmos it is same like Hindi character mr_char.txt

nishad commented 4 years ago

@rkcosmos Malayalam (https://en.wikipedia.org/wiki/Malayalam), belongs to Group 6. https://github.com/JaidedAI/EasyOCR/pull/143 This PR contains character and word lists.

imvladikon commented 4 years ago

Hi! Thanks for your work. Some notes about Hebrew, there are some ending form of letters (it means that some letter is changing their form if they are placed at the end of words) https://en.wikipedia.org/wiki/Final_form Also there are diacritical signs https://en.wikipedia.org/wiki/Niqqud that used to represent vowels or distinguish between alternative pronunciations of letters (in Arabic also there are final forms(and not only) and diacritical signs) I didn't provide diacritical signs, assume it's better to train first of all without them (usual writing consists from usual letters without diacritical signs)

imvladikon commented 4 years ago

remembered the important thing. in Hebrew, there is cursive(https://en.wikipedia.org/wiki/Cursive_Hebrew) and sometimes people mixed it up together with usual writing even using printed matter, it's the same letters (chars), but let's say it's another font (e.g. https://opensiddur.org/wp-content/uploads/fonts/display-font-charmap.php?fnt=DorianCLM-Italic ) maybe it's also better not to implement immediately, don't know

rkcosmos commented 4 years ago

@nishad Malayalam and Tamil are both Dravidian but do not use the same script. So I have to build 2 model. @imvladikon ok, will try to keep this in mind when building Hebrew model.

rkcosmos commented 4 years ago

Question for Indian: I'm looking into Hindi char and dict, there are a lot of chars seen in word list but not in char list. Examples are ['ा', '्', 'ि', 'ी', 'ं', 'ो', 'ु', 'ँ', 'ू', 'ड़', 'ै']. What are these symbols?

Vijayabhaskar96 commented 4 years ago

@rkcosmos Those are part of the existing alphabet when combined it creates a new alphabet, I think the technical term is grapheme? I'm not sure. I would like to know they render fine or something happens like it did with Tamil.

rkcosmos commented 4 years ago

@Vijayabhaskar96 So far, Devanagari doesn't have any problem. They support unicode well.

imvladikon commented 4 years ago

another addition about Hebrew;) and it's important. some diacritic signs are important, like geresh and gershayim. using geresh with ג ז צ we could use for the sounds - j g, ch, that are not represented in the alphabet and double geresh (gershayim) it's for widely spread short phrase, words (kitsur) most famous is the תנ"ך (Tanakh). Sometimes people could use usual quotation marks (apostrophe) instead of typing geresh or gershayim (e.g. תנ''ך)

omprakash-jena commented 4 years ago

Can you please simulate it for Odia Language which is also a classical language of India coming under group 6. My MailID: jena.omprakash@gmail.com i can provide you the datasets regrading odia language.

rkcosmos commented 4 years ago

@omprakash-jena You can create a pull request to add files. Or you can also attach files in comment here.

rkcosmos commented 4 years ago

@Vijayabhaskar96 @sardasumit @junaidgirkar @upadhyayprakash

Question for Indian: I'm testing Devanagari model with this hi1 . The result is ['50', '40', 'बसझरकर', 'SPEED', 'मािकट', 'LIMIT', 'BASRURKAR', 'MARKET'].

The problem is with मािकट. It doesn't look like what is written in the original image. But when I do for c in 'मािकट': print(c), I got म ा ि क ट which looks quite right. What's going on here? Is it just the way python render Hindi?

nishad commented 4 years ago

@rkcosmos

Both devanagari strings are identified differently from the image. They are बसरूरकर and मार्किट. These characters join (र+ ्+ क+ ि ), and renders as र्कि

rkcosmos commented 4 years ago

@nishad Wow, this is really hard. It means OCR need to understand how to combine character in a very complex way.

nishad commented 4 years ago

@rkcosmos, this is complex and I am not knowledgeable in explaining this. @santhoshtr could you please share your expertise ?

Vijayabhaskar96 commented 4 years ago

@rkcosmos I don't speak Hindi, but this is interesting. I think the problem is hi_char.txt doesn't have all the chars. For example: कि is not there but क and ि are present while क+ ि = कि If you look at Tamil chars all the alphabets with those extra parts are present instead of just including the root alphabet and the extra parts separately. i.e க + ா = கா, both க and கா are present in ta_char.txt but not ா I think this is the right way to add characters to the lang_chat.txt file. There are many combinations here which I think they all+many other should be added as unique characters but just 85 chars exist in ha_char.txt I think this depends on what people call it as a character, but if कि exists in the hi_char.txt it will be easy for the network to simply use it instead of figuring out the order of all the parts that constitute the alphabet right?

I might be wrong here about Hindi alphabets as I don't speak the language, do correct me if I'm wrong @nishad

rkcosmos commented 4 years ago

@Vijayabhaskar96 It depends on how many combinations are there for each language. It's possible to do both ways. For example, I did Thai before with separated characters. We have something like ท + ี + ่ = ที่. First three characters are in the list but the combined form is not. For Thai, I use separated character because number of all possible combinations is just too much to imagine. For Tamil, you have 325 combined forms, I think it should be doable. Now for Hindi, they have र्कि which is a combined form of 4 separated characters! I would guess their number of all possible combined form is extremely large. So we might have to go with separated char way. I just hope that my current neural network's setting can learn such complexity. Will let everyone know in a few days if it works or not.

Vijayabhaskar96 commented 4 years ago

According to the link In my previous comment there about 500 combinations in Hindi, I don't think even with all combinations that aren't in that link it will exceed 2000+ characters, Telugu list has 2000+ chars. So try what works best, findings from this experiment may simplify other language training approaches.

imvladikon commented 4 years ago

Belarusian is ready? commit probably list need update

rkcosmos commented 4 years ago

example3

Version 1.1.5 now support Devanagari script. Please test, feedback and spread the words to your community.

sardasumit commented 4 years ago

Hello Please send me training code I will look into this issue.

Thank you

Sent from Yahoo Mail on Android

On Tue, 21 Jul 2020 at 7:01 pm, Vijayabhaskarnotifications@github.com wrote:

@rkcosmos I don't speak Hindi, but this is interesting. I think the problem is hi_char.txt doesn't have all the chars. For example: र् is not there but र and ् are present while र + ् = र् If you look at Tamil chars all the alphabets with those extra parts are present instead of just including the root alphabet and the extra parts separately. i.e க + ா = கா, both க and கா are present in ta_char.txt but not ா

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

zoghal commented 4 years ago

Hello dear Can i help for the Persian (Farsi) language ? -I can supply some popular words and characters and also i working in the field of Persian typeface design & Libre persian fonts

rkcosmos commented 4 years ago

@zoghal please add more words to easyocr/character/fa.txt

rkcosmos commented 4 years ago

Cyrillic script (Russian, Serbian(cyrillic), Belarusian, Bulgarian, Mongolian, Ukrainian) is available for testing in last update (not on pip yet), please test and feedback.

imvladikon commented 4 years ago

Belarusian, Russian, Ukrainian: https://colab.research.google.com/drive/1Sy1endzzbommR2b5pyw7YIoa8CIV-jkN?usp=sharing the quality is not so bad(but not best, in the Colab wiki-test for Ukrainian and Belarusian is far away from best. Belarusian model could not split sometimes words correctly), in Ukrainian, there is an additional sign, apostrophe ’ for softness or on the contrary for emphasizing hard consonant (like ім'я) depends on some phonetical rules, cases.

rkcosmos commented 4 years ago

@imvladikon Thanks for the analysis, it's very useful. I'll keep in mind the special character issue for next fine-tuning. For low-resolution image, you might need to change some parameters to get better result. For example, I would try mag_ratio = 1.2 (that's zooming by 20%). One thing I don't understand in your colab is with expected result of ulica_be.jpg. It's the phase 'лінгвістичний'. Is it a typo or your language has special rule to combine characters?

imvladikon commented 4 years ago

@rkcosmos yeah, it's a typo) for the Belarusian language should be "лінгвістычны", fixed it. this word accidentally was written on Ukrainian "лінгвістичний" ("linguistic")

rkcosmos commented 4 years ago

Arabic is now supported in v1.1.6. To use the last version, you need to uninstall first pip uninstall easyocr

and install directly from source code pip install git+git://github.com/jaidedai/easyocr.git

please try and feedback.

Vijayabhaskar96 commented 4 years ago

You can do that in a single command itself. pip install git+git://github.com/jaidedai/easyocr.git --upgrade no need to uninstall manually.

Vijayabhaskar96 commented 4 years ago

@rkcosmos Thanks for adding support for Tamil, tested few images and it worked well, some images required fiddling with the args for better results, but anyway great job for overcoming the unicode issues, may I know what you did? And if you used pyvips as I suggested how did it go? Can you explain shortly?

fnasim commented 4 years ago

@rkcosmos You marked Urdu as done (great news!). Can I try it in the latest?

rahilwazir commented 4 years ago

@fnasim @rkcosmos I've been trying to test urdu with multiple variations, although it's not completely there yet, but it's close. There are still some subtle differences.

The بھائی is recognized as میں ,بجال is recognized as لکھیں ,ئیں recognized as اکھیں etc. I will continue to test more and share the results.

JaidedAI / EasyOCR

List of languages in development #91