PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Apache License 2.0
39.34k stars 7.35k forks source link

Multilingual OCR Development Plan #1048

Open D-DanielYang opened 3 years ago

D-DanielYang commented 3 years ago
model name description model size download Update Date
ch Chinese and English 3.71M inference model / trained model 2020.9.22
ch_tra chinese traditional 5.63M inference model / trained model 2021.1.21
en English 2.56M inference model / trained model 2020.9.22
fr French 2.65M inference model / trained model 2021.9.22
ar Arabic 2.53M inference model / trained model 2021.1.21
es Spanish 2.53M inference model / trained model 2021.1.21
pt Portuguese 2.63M inference model / trained model 2021.1.21
ru Russia 2.63M inference model / trained model 2021.1.21
ge german 2.65M inference model / trained model 2020.9.22
kr Korean 3.9M inference model / trained model 2020.9.22
jp Japanese 4.23M inference model / trained model 2020.9.22
it Italian 2.53M inference model / trained model 2021.1.21
hi Hindi 2.63M inference model / trained model 2021.1.21
ug Uyghur 2.63M inference model / trained model 2021.1.21
fa Persian 2.63M inference model / trained model 2021.1.21
ur Urdu 2.63M inference model / trained model 2021.1.21
oc Occitan 2.53M inference model / trained model 2021.1.21
mr Marathi 2.63M inference model / trained model 2021.1.21
ne Nepali 2.63M inference model / trained model 2021.1.21
rs_cyrillic Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21
rs_latin Serbian(latin) 2.53M inference model / trained model 2021.1.21
bg Bulgarian 2.63M inference model / trained model 2021.1.21
uk Ukranian 2.63M inference model / trained model 2021.1.21
be Belarusian 2.63M inference model / trained model 2021.1.21
te Telugu 2.63M inference model / trained model 2021.1.21
kn Kannada 2.63M inference model / trained model 2021.1.21
ta Tamil 2.63M inference model / trained model 2021.1.21
mg Mongolian -- Ongoing
bg Bangla -- Need dict and corpus
bm Burmese -- Need dict and corpus call for contribution
ku_cent kurdish central -- PR8347 call for contribution
od Odia -- PR6348 call for contribution
th thai -- PR6719 issue chat call for contribution
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed:

  1. In folder ppocr/utils/dict, it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.

  2. In folder ppocr/utils/corpus, it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language. Maybe, 50000 words per language is necessary at least. Of course, the more, the better.

  3. call for contributions to add new language support for PaddleOCR. For anyone might be insterested in traing the new language model, Guidance to train the model is provided. We are calling contributions to add new language support for PaddleOCR.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

saheya commented 3 years ago

Traditional Mongolian

omar16100 commented 3 years ago

I would love to work on "Bangla"

levanpon98 commented 3 years ago

I very happy if you do that with Vietnamese

HusseinYoussef commented 3 years ago

How about Arabic? That would be great.

Hieung28 commented 3 years ago

I've find out that PADDLE OCR algorithm cannot recognize some special characters (such as comma, semicolon, or dot...) when the language is english. Is there any possible way that i can fix this problem

GmGniap commented 3 years ago

I would like to contribute to add the Burmese language. Is it only needed to submit two text files - dict & corpus? How further process do we need to provide?

xeron56 commented 3 years ago

Adding "Bangla" will be grate for the people in south Asia

giranntu commented 3 years ago

Adding "Traditional Chinese (zh-TW)" would be great support.

Ru-Van commented 3 years ago

Do you have preTrained Russian recognition model?

SasiAravind commented 3 years ago

Hi adding " Tamil" language will be very grateful.

Tamil_dict.txt Tamil_corpus.txt

Need more help plz refer this issue: https://github.com/JaidedAI/EasyOCR/issues/39

fcakyon commented 3 years ago

I can help with Turkish language.

krzynio commented 3 years ago

I can help with polish language.

xmy0916 commented 3 years ago

@GmGniap Hello, Can you provide the corpus file of Burmese Language?

xmy0916 commented 3 years ago

@shahidul56 Hello, Can you provide the corpus file of Bangla Languag?

azmat21 commented 3 years ago

All models updated in 2021.1.21 cannot be downloaded with following Error: { code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

D-DanielYang commented 3 years ago

All models updated in 2021.1.21 cannot be downloaded with following Error: { code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

Sorry for the invalid links and all of them have been revised now, you can try again.

redcinelli commented 3 years ago

I very happy if you do that with Vietnamese

1847, seems to be ongoing.

xmy0916 commented 3 years ago

@redcinelli Thank you very much. The Vietnamese model is in training and will be available soon~

fcakyon commented 3 years ago

model name description model size download Update Date ch Chinese and English 3.71M inference model / trained model 2020.9.22 cht chinese traditional 5.63M inference model / trained model 2021.1.21 en English 2.56M inference model / trained model 2020.9.22 fr French 2.65M inference model / trained model 2021.9.22 ar Arabic 2.53M inference model / trained model 2021.1.21 xi Spanish 2.53M inference model / trained model 2021.1.21 pu Portuguese 2.63M inference model / trained model 2021.1.21 ru Russia 2.63M inference model / trained model 2021.1.21 ge german 2.65M inference model / trained model 2020.9.22 kr Korean 3.9M inference model / trained model 2020.9.22 jp Japanese 4.23M inference model / trained model 2020.9.22 it Italian 2.53M inference model / trained model 2021.1.21 hi Hindi 2.63M inference model / trained model 2021.1.21 ug Uyghur 2.63M inference model / trained model 2021.1.21 fa Persian 2.63M inference model / trained model 2021.1.21 ur Urdu 2.63M inference model / trained model 2021.1.21 rs Serbian(latin) 2.53M inference model / trained model 2021.1.21 oc Occitan 2.53M inference model / trained model 2021.1.21 mr Marathi 2.63M inference model / trained model 2021.1.21 ne Nepali 2.63M inference model / trained model 2021.1.21 rsc Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21 bg Bulgarian 2.63M inference model / trained model 2021.1.21 uk Ukranian 2.63M inference model / trained model 2021.1.21 be Belarusian 2.63M inference model / trained model 2021.1.21 te Telugu 2.63M inference model / trained model 2021.1.21 ka Kannada 2.63M inference model / trained model 2021.1.21 ta Tamil 2.63M inference model / trained model 2021.1.21 mg Mongolian -- Ongoing
bg Bangla -- Need dict and corpus
vi Vietnamese -- Need dict and corpus
bm Burmese -- Need dict and corpus
tk Turkish -- Need dict and corpus
po polish -- Need dict and corpus
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed:

1. In folder [ppocr/utils/dict](./ppocr/utils/dict),
   it is necessary to submit the dict text to this path and name it with `{language}_dict.txt` that contains a list of all characters. Please see the format example from other files in that folder.

2. In folder [ppocr/utils/corpus](./ppocr/utils/corpus),
   it is necessary to submit the corpus to this path and name it with `{language}_corpus.txt` that contains a list of words in your language.
   Maybe, 50000 words per language is necessary at least.
   Of course, the more, the better.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

@grasswolfs model name for Turkish should be "tr" instead of "tk", it is the widely used abbreviation for Turkish.

fcakyon commented 3 years ago

I have also opened a pr for Turkish dict and corpora: https://github.com/PaddlePaddle/PaddleOCR/pull/1856

tink2123 commented 3 years ago

Thanks @habout632 for adding Southeast Asian languages via #1896

yumeliu commented 3 years ago

Here is a dictionary for Greek. el_dict.txt

alenma04 commented 3 years ago

Hi , did we have a model to detect all English characters along with special characters like.,"()

Jane-Ding commented 3 years ago

hi, thank you for the great work! I just wonder whether you will add traditional Chinese to the general model? Right now, the general model can support Chinese(sim), English and numbers.

JITESH11989 commented 3 years ago

Hi, can we give line data above 50 max_char_length data for training? after training rec model on 25 char length as well as 50 char length found that 25 char length less loss and good acc but 50 char length data more loose and less acc please find sample devnagri data

train_img/0022_BindiyaKiAathmakatha_Img_300_Org_Page_0001_crop_9.jpg बीत गया । असमय के इस बुढ़ापे की देहली पर बैठी, मौत की train_img/0022_BindiyaKiAathmakatha_Img_300_Org_Page_0001_crop_10.jpg प्रतीक्षा कर रही हूँ । पर लगाता है उसने भी सबों के साथ-साथ

MANISH007700 commented 2 years ago

After downloading the inference and Trained model, how can I use them ? Can anyone point out some resources of Testing / Evaluating code using these models

Thanks

wuye9036 commented 2 years ago

请问有计划开发一个统一模型,支持多语种文字混合排版的图片的识别吗?谢谢。

ESWZY commented 2 years ago

Traditional Mongolian 👀

thongvhoang commented 2 years ago

model name description model size download Update Date ch Chinese and English 3.71M inference model / trained model 2020.9.22 ch_tra chinese traditional 5.63M inference model / trained model 2021.1.21 en English 2.56M inference model / trained model 2020.9.22 fr French 2.65M inference model / trained model 2021.9.22 ar Arabic 2.53M inference model / trained model 2021.1.21 es Spanish 2.53M inference model / trained model 2021.1.21 pt Portuguese 2.63M inference model / trained model 2021.1.21 ru Russia 2.63M inference model / trained model 2021.1.21 ge german 2.65M inference model / trained model 2020.9.22 kr Korean 3.9M inference model / trained model 2020.9.22 jp Japanese 4.23M inference model / trained model 2020.9.22 it Italian 2.53M inference model / trained model 2021.1.21 hi Hindi 2.63M inference model / trained model 2021.1.21 ug Uyghur 2.63M inference model / trained model 2021.1.21 fa Persian 2.63M inference model / trained model 2021.1.21 ur Urdu 2.63M inference model / trained model 2021.1.21 oc Occitan 2.53M inference model / trained model 2021.1.21 mr Marathi 2.63M inference model / trained model 2021.1.21 ne Nepali 2.63M inference model / trained model 2021.1.21 rs_cyrillic Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21 rs_latin Serbian(latin) 2.53M inference model / trained model 2021.1.21 bg Bulgarian 2.63M inference model / trained model 2021.1.21 uk Ukranian 2.63M inference model / trained model 2021.1.21 be Belarusian 2.63M inference model / trained model 2021.1.21 te Telugu 2.63M inference model / trained model 2021.1.21 kn Kannada 2.63M inference model / trained model 2021.1.21 ta Tamil 2.63M inference model / trained model 2021.1.21 mg Mongolian -- Ongoing bg Bangla -- Need dict and corpus
vi Vietnamese -- Ongoing bm Burmese -- Need dict and corpus
tr Turkish -- Need corpus po polish -- Need dict and corpus
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed:

  1. In folder ppocr/utils/dict, it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.
  2. In folder ppocr/utils/corpus, it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language. Maybe, 50000 words per language is necessary at least. Of course, the more, the better.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

Hi, thank you for the great work! I I sent you a corpus for Vietnamese. The file was attached below. vietnamese_dict.txt. This file gets from this research: Download: https://github.com/VinAIResearch/dict-guided You can evaluate on VinText dataset, text scene detection for Vietnamese, downloaded in Github. Thank you.

@inproceedings{m_Nguyen-etal-CVPR21,
      author = {Nguyen Nguyen and Thu Nguyen and Vinh Tran and Triet Tran and Thanh Ngo and Thien Nguyen and Minh Hoai},
      title = {Dictionary-guided Scene Text Recognition},
      year = {2021},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition (CVPR)},
    }
dynamicguy commented 2 years ago

Please add Bangla language. here are the dict and corpus:

dict corpus

tinamore commented 2 years ago

Hi team, Please update Vietnamese, I'm very excited about this project, Thanks very much

maksudcs commented 2 years ago

@grasswolfs & @xmy0916 Dear already shared dict & corpus file for Bangla. please check. I have also added here.

bg_dict.txt bg_corpus.txt

mingjun1120 commented 2 years ago

Can I know is there Malay Language support? Malay is the main language from Malaysia.

erfaneshrati commented 2 years ago

Suppose we have an image with texts from multiple languages. How do you approach this problem? One way is to ensemble all the languages and take the most confident one but it turns out to be very inaccurate because of confidence miscalibration. Can't we train a single recognition model for all languages or at least a couple of them? I think it will be a very helpful model for applications where we don't know the language beforehand or an image contain multiple languages.

todhm commented 2 years ago

Suppose we have an image with texts from multiple languages. How do you approach this problem? One way is to ensemble all the languages and take the most confident one but it turns out to be very inaccurate because of confidence miscalibration. Can't we train a single recognition model for all languages or at least a couple of them? I think it will be a very helpful model for applications where we don't know the language beforehand or an image contain multiple languages. Strong Upvotes for this opinion.

babraham123 commented 2 years ago

Hi @grasswolfs, thanks so much for all the work you've put in. I've included a PR for the Amharic language, which is spoken by over 60 million people. https://github.com/PaddlePaddle/PaddleOCR/pull/4882

One potential issue is that Amharic words contain a number of prefixes and suffixes to indicate the object, number of items, tense, gender, negation and so. Thus, a single verb may morph in a number of ways that are not all included in the dictionary.

babraham123 commented 2 years ago

Hi @grasswolfs, I also submitted a PR for the Tigrinya language, which is similar to Amharic and spoken by over 10 million people. https://github.com/PaddlePaddle/PaddleOCR/pull/4881

It has the same mutation issue as Amharic. Also, Arabic numerals are commonly used despite having its own numeral system.

skoetje commented 2 years ago

Hi @grasswolfs, I've submitted a PR for the Dutch language here: https://github.com/PaddlePaddle/PaddleOCR/pull/5161

kangshilei commented 2 years ago

嗨,你好。除了上面的那些连接,有最新的语言model吗,我看官方说支持80多种语言?

ejatjon commented 2 years ago

ug_dict.txt uyghur_corpus.txt

维吾尔语识别非常不好或者没有识别。 希望完善一下模型,非常感谢你们🙏

hw-coding commented 2 years ago

你好,可以识别挪威语吗?

hw-coding commented 2 years ago

你好,希望可以识别挪威语。只找到了1 In folder [ppocr/utils/dict],没有找到 2 In folder [ppocr/utils/corpus]。ocr小白,请问怎么添加这两个文件呢?

Evezerest commented 2 years ago

嗨,你好。除了上面的那些连接,有最新的语言model吗,我看官方说支持80多种语言?

all multilingual models can be found here

JareelSkaj commented 2 years ago

Is there any tutorial on how can I train my own model out of my own corpus and sample images?

Evezerest commented 2 years ago

Is there any tutorial on how can I train my own model out of my own corpus and sample images?

Thanks for the attention, the multilingual model training tutorial will be released next week!

calibretaliation commented 2 years ago

Hi team, Thank you so much for the great work. I'm very excited about the vietnamese dict anf corpus and models, could you please update vietnamese language soon ? Again, thankyou so much and congrats on great work

Bellman281 commented 2 years ago

I can help for these languages: Turkish -- tr Azerbaijani -- az Faris -- fa Afghani -af

Bhavin-Prydan commented 2 years ago

how I can implement multi-language like English, Urdu, and Tamil in one paddle-OCR instancE with python

xuhuaren commented 1 year ago

Please add Thai language, appreciate!

thai_dict.txt

thai_corpus.txt e

topatsaya commented 1 year ago

how about lao character I can help.....