JaidedAI / EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
https://www.jaided.ai
Apache License 2.0
23.72k stars 3.11k forks source link

Arabic Language #25

Closed MohamedAliRashad closed 4 years ago

MohamedAliRashad commented 4 years ago

I would like to help with the addition of the Arabic Language ... what is needed to be done ?

rkcosmos commented 4 years ago

Hi, thanks for offering to help and welcome to EasyOCR. To add Arabic, We need 2 additional files.

  1. In folder easyocr/character, We need 'ar_char.txt' that contains list of all Arabic characters. Please see format/example from other files in that folder.
  2. In folder easyocr/dict, We need 'ar.txt' that contains list of all common Arabic words. Think of words in dictionary or Arabic words from Wikipedia. Normally, I will just scrape a lot of Arabic Wikipedia pages and filter only word with Arabic character.
MohamedAliRashad commented 4 years ago

31 A good start can be here.

rkcosmos commented 4 years ago

Nice, #31 merged. Can we have more words? As many as possible.

MohamedAliRashad commented 4 years ago

@rkcosmos What is the average number of words for a robust OCR ?

rkcosmos commented 4 years ago

Other languages have around 30000. Popular languages have around 50000.

mohdimran043 commented 4 years ago

Any update on the arabic OCR

MohamedAliRashad commented 4 years ago

@mohdimran043 We still need more words if you can help

rkcosmos commented 4 years ago

Question about left-to-right writing and character merge shape. If I have this image

image

Is "تتطلب test القارة عمود" what OCR should read? or I mess up order?

MohamedAliRashad commented 4 years ago

well, first of all the image is unreadable (except for "test" of course). This is mainly because the characters are ordered from left-to-right while they should have been ordered from right-to-left.

Secondly, the sentence has no actual meaning ... it translates to "demanded test pole continent"

and answering your question, yes ... the OCR will be great if it could salvage this words out of such an image.

MohamedAliRashad commented 4 years ago

BTW, i managed to get +1.2 Million arabic words but they might need some redundancy cleaning and i have no time (now) for the process ... Do you want to add them in a new PR or wait for some time until i can clean them ?

rkcosmos commented 4 years ago

Can you give me 1 sensible sentences with both arabic and english word? I want to see if I can generate image correctly?

About word list, I can wait. Is this 1.2 Million the file someone posted to pinned issue? Would be great if you can combine ... also ok if not combine because 1.2M is more than enough already.

MohamedAliRashad commented 4 years ago

@rkcosmos GitHub comments is from left-to-right so try to visual this examples on right-to-left editior. "كنا فى اشد الاحتياج الى test cases خاصة بالمنتج النهائى" "كتاب Simon الأخير كان رائع" "لماذا تفعل هذا يا Thomas ?" "هل نكتب know ام now ?"

yeah, it was on a comment on the pinned issue ... it just has this problem of adding the arabic characters as an example words and i think it can hurt the performance if it was left like this.

rkcosmos commented 4 years ago

Screenshot from 2020-07-15 15-46-20 Test my understanding again.

RTL line is what you normally see.Whenever OCR see text in RTL line, it should give text in Github comment line as output (assume I use normal text editor-LTR ). Do I understand correctly?

MohamedAliRashad commented 4 years ago

yes, the RTL format is correct except for one thing ... ? should have been ؟ (but it's my mistake, i didn't add them on the test cases)

AhmedBytesBits commented 4 years ago

hi @rkcosmos . I am ready to help as well

rkcosmos commented 4 years ago

Just got additional words from #173 . @MohamedAliRashad Can you estimate when 1.2M words will be cleaned? I just finish Hindi model and want to train Arabic soon. @ahmedrshdy thanks, we are almost ready.

MohamedAliRashad commented 4 years ago

@rkcosmos The problem is i am going through some exams right now so my time is a little bit crowded.

I have seen the PR and think it's a good starting point for an initial training, we can continue the learning when the 1.2M dataset is ready.

rkcosmos commented 4 years ago

@MohamedAliRashad Can you upload that 1.2M file? I may be able to clean it.

MohamedAliRashad commented 4 years ago

@rkcosmos ar.txt

I already removed all english characters on the original dataset, what is left to be done is:

  1. remove the duplicates.
  2. remove one character words as they have no actual meaning and i suspect they will hurt the performance.
loayamin commented 4 years ago

Here is the cleaned version of the above 1.2M list with the ar-wiki corpus list. I removed the duplicates, one character words, and diacritics. ar.txt

MohamedAliRashad commented 4 years ago

@loayamin Thanks for your effort, Open a PR with the new word dictionary for merging.

rkcosmos commented 4 years ago

I updated the file. i have one more question about punctuation. According to Wikipedia, you have 2 special punctuation marks: reversed question mark: ⟨؟⟩, and a reversed comma: ⟨،⟩. Other than these 2 and what we have for every languages,

https://github.com/JaidedAI/EasyOCR/blob/d518dede3e5962ca05166876ede9ebd6a49145cb/easyocr/easyocr.py#L41-L42

are there other marks used commonly for your language?

loayamin commented 4 years ago

The reversed semicolon ⟨؛⟩ and the guillemets marks ⟨« »⟩.

I am not sure about this but I think the diacritics should be either added or ignored. They may cause problems recognizing the characters when they are closely attached to them.

MohamedAliRashad commented 4 years ago

@rkcosmos we also have our own numbering system

 ar_number = '٠١٢٣٤٥٦٧٨٩' 
rkcosmos commented 4 years ago

These four لالآلألإ are in character list but never appear in word list. Should I ignore them? the same for اً in persian?

MohamedAliRashad commented 4 years ago

@rkcosmos These characters are redundant as they can be forged by combining أ إ آ ا and ل, we just leave them on our keyboards for convenience.

loayamin commented 4 years ago

No, they should be in the list. The Arabic characters ل and ا have different shapes in most fonts when combined.

rkcosmos commented 4 years ago

Arabic is now supported in v1.1.6. To use last version, you need to uninstall first pip uninstall easyocr

and install directly from source code pip install git+git://github.com/jaidedai/easyocr.git

please try and feedback.

loayamin commented 4 years ago

I just tried using easyocr on a low contrast scanned page, here is the result (replaced the line breaks with spaces):

Transcript:

سوق الجاءة إلى الجنوب من الجامع الكبير. ابتنى المدرسة جلال الدين ابن محمد بن أبي بكر السيري سنة 815هـ/ 1412م، كما جاء في كتابة على عتب خشبي لباب بيت الصلاة. وقد شهدت هذه المدرسة مجالس العلم التي تولى أمرها عدد من العلماء في علوم اللغة والدين. ولمسجد الجلالية مئذنة بديعة البناء ما تزال عامرة سامقة جميلة، وتزدان بزخارف جميلة، وهي قطعة فنية من المعمار اليمني المتميز. ويرجح أنها بنيت في عهد المهدي العباس في القرن العاشر الهجري/ السادس عشر الميلادي. وتعد القرى المحيطة بمدينة إب امتدادا لها، أهمها (جرافة) حيث كان بها مدرسة وجامع جميل البناء يرجع إلى القرن التاسع الهجري/الخامس عشر الميلادي. وقد بلغت شهرة المدرسة والجامع مبلغا عظيما. كما أن بها سداً يعرف بسد جرافة. ومن القرى التابعة (أبلان) وبها سد للماء أهمل الآن. وصارت كلتا القريتين اليوم حيين في المدينة الواسعة.

EasyOCR:

الجامع إلى الجذوب من ا لحاءة سوق الدين الدرسة جلال ابتنى الكبير السيري أبي بسن 5ه-1 بسن في جداء كم } 1412م) 5 1 8 ه| س نة بيت لباب خشي ع تب على كتا رة مرة المدر وقد شهدت هدء الصلاة . عل د ها أمر العلم التي تولى مجالس الدين * اللغة و علوم العلماء من المنا ء بذيعة لمسجد الجلالية مئذنة نزال عامرة سامقة جميلة وهي قطعة جميلة وتزدان بزخارف المتمز. اليمني المعمار فنية من في ءهد ااهدي ويرجح أنها بنيت الهجري | ثمر في القرن الءا العباس اليلا دي . عصر دس السا', 'المحيطة بمدينة إب القرى وتع د <ءت ( جرافة ) أهمها متدادا ها جميل البناء وجامع كان بها مدرسة ي | اذجر يرجع إلى القرن التاسع الميلادي. وفد بلغت عثر مس الذ ا مباغا عظيما والجامع شهرة الدرسة افة جر كما أن :4 سداً يعرف بسد التابعة (أبلان وبها سد القرى من الآن. وصارت كلتا أهمسل للماء المدينة في حيءن اليوم ريتين القر الوا سعة

Using Kraken with the Arabic Generalized Model:

سوق الجاءة إلى الجموب من الجامع الكبير. ابتنى المدرسة جلال الدين بن محمد بن ابي بكر السيري سنة 15 2ه 1)1م ، كما جاء في كتابة على عتب خشي لباب بيت لصلاة . وقد شهدت هه المدرسة مجالس العلم التي تولى أمرها عدد من العلماء في علوم اللغة والدين. ولمسجد الجلالية مئذنة يديعة البناء مـا تـزال عـامرة سامـقـة جمـيـلـة ، وتزدان بزخارف جميلة ، وهي قطعة فنية من المعمار اليمني المتسميز. زيرجح أتها بنيت في عهد المهدي - لعباس في القرن العاشر امجري / السادس عشر الميلادي . وتعد القرى المحيطة بمدينة إب امتدادا صا ، أهمها (جرافة) حيث كان بها مدرسة وجامع جميل البناء يرجع إلى القرن التاسع الجري / لخامس عثر الميلادي. وقد بلغت شهرة المدرسة والجامع مبلغا عظيما . -ا كما ان بها سدا يعرف بسد جرافة. ومن القرى التابعة (أبلان) وبها سد لمـاء أهما الآن . وصـارت كلتا القريتين اليوم حيي في المديتة الواسعة .

Image used:

text

rkcosmos commented 4 years ago
  1. I just remember arabic is right-to-left, my paragraph argument is giving wrong order.
  2. I have no idea how to read this and found it hard to compare by eye because of word ordering is very confusing to me. Can you comment on what is happening here? Which part is the most problematic? Can you also try changing contrast paramters to be higher like contrast_ths=0.4 and adjust_contrast=0.7?
loayamin commented 4 years ago

Yes, the original text is right-to-left.

The resulted text using EasyOCR is unfortunately unreadable (~20% readable). and the recognized words seem in the wrong order (I am not sure if this is because how they are recognized or they are in the wrong postilion on each line). For example in the first line the word سوق is the first word in this text. On the EasyOCR text it comes on the 7th position.

Here is a comparison between the transcript (green) and EasyOCR result (red, this time with contrast_ths=0.4 and adjust_contrast=0.9)

image

rkcosmos commented 4 years ago

[[[[19, 24], [373, 24], [373, 648], [19, 648]], 'سوق ا لحاءة إلى الجذوب من الجامع الكبير ابتنى الدرسة جلال الدين بسن 5ه-1 بن أي السيري س نة 5 1 8 ه| 1412م) كم } جداء في كتا رة على ع تب حثي لباب بيت الصلاة . وقد شهدت هدء الدر مرة مجالس العلم التي تولى أمر ها عل د من العلماء علوم اللغة و الدين . لمسجد الجلالية مئذنة بديمة المنا نزال عامرة سامقة جميلة وتزدان بزخارف جميلة وهي قطعة فنية من المعمار اليمني المتمز. ويرجح أنها بنيت في عهد الهدي العباس في القرن الءا ائر اهجري | السا دس عصر اليلا دي .'], [[[23, 691], [373, 691], [373, 1140], [23, 1140]], 'وتعد القرى المحيطة بمدينة إب متدادا ها أهمها ( جرافة ) حيت كان ها مدرسة وجامع جميل البناء يرجع إلى القرن التاسع اذجر ي | الذ ا مس عثر الميلادي. وفد بلغت شهرة الدرسة والجامع مباغا عظيما كما أن :4 سداً يعرف بسد جر افة من القرى التابعة (أبلان وبها سد للماء أهمسل الآن. وصارت كلتا القر ريتين اليوم حيءن في المدينة الوا سعة']]

Is this better? trying to change combining logic to RTL

loayamin commented 4 years ago

Yes, much better.

image

By the way, I'm using this tool for comparison.

rkcosmos commented 4 years ago

ok, code is updated. You can get above result from running with detail = 1, contrast_ths=0.4, paragraph=True.

loayamin commented 4 years ago

I edited the original image using scan tailor to try EasyOCR on one of my usual use cases. Here. The problems with the resulted text are:

  1. Additional spaces between letters of the same words. For example الجامع with EasyOCR is الجا مع or in the numbers 8 1 5.
  2. The common / separator for dates is replaced with less common ⟨|⟩ symbol.
  3. The Arabic letter و is recognized 9 number. This only appeared once though.
  4. There multiple letters either replaced by others or omitted.
  5. The numbers are in the wrong order.

سوق الجاءة إلى الجذوب من الجا مع الكبير . ابتنى الدرسة جلال الدين ابن مح مد بن أي بكر السبري سنة 5 1 8 ه| 2 1 4 1م كما جاء في كتابة على عتب خشبي لباب بيت الصلاة . وقد شهدت هذء الدرسة مجالس العلم التي توى أمرها عدد من العلماء في علوم اللغة 9 الدين . ولسجد الجلالية مئذنة بديمة البناء ما نزال عامرة سامقة جميلة وتزدان بزخارف جميلة وهي قطعة فنية من المعمار اليمغي استمز . ويرجح أنها بنيت فيءهد الهدي المباس في القرن الما شر اغجري | السا دس عشر اليلا دي . وتعد القمرى الميطة بمدينة إب امتدادا ها أهمها ( جر افة ) حيث كان بها مدرسة وجامع جميل البناء يرجع إى القرن التا سع اشجري / /خا مس عثر الميلادي , وقد بلغت شهرة الدرمة والجامع مباغا عظيما . كما أن 4 سداً بدرف ب سل جر ا فة , ومن القرى التابعة (أبلان وبها سد للماء أهمسل الآنء وصارت كلتا القريتين اليوم حيين فسي المدينة الو سعة .

image

rkcosmos commented 4 years ago

thanks,

  1. space problem. There is no easy fix yet, will have to think about it.

2.+3. this is about probability of each character. Will think how should we fix this systemically.

  1. So far, arabic is the hardest language to train due to text direction and complex reshape. One thing that can help in this low resolution case is by setting mag_ratio > 1 (for example mag_ratio=1.5).

  2. originated from space problem

hahmad2008 commented 3 years ago

Hi @rkcosmos Thank you for sharing such a great OCR which is really doing well for Arabic. But I have a question, regarding the separator (newline). I expected to receive lines of words based on what is extracted from the image (as it is the case for English easier is able to return lines very well).

However, I received words each word in the individual item in the list, I check the option to get (paragraph = True) which is not the case I am looking for it.

Is there any way to receive lines as in the image for Arabic ocr?

Thanks

rkcosmos commented 3 years ago

We don't have a support for line separator. You can do that by analyzing the result from individual items in list format.

abdoelsayed2016 commented 2 years ago

@rkcosmos I face some problems with I combine Arabic with English like this image image

import easyocr from PIL import Image

img = Image.open('data/full_eval/3_0.tif')

reader = easyocr.Reader(['en','ar']) # this needs to run only once to load the model into memory result = reader.readtext(img, detail = 0)

print(result)

the result is : ['؟٦ ٨٧٨٥٥٣ ٥٣٨٥٣ ٧٥٧٣'] when I remove ar from Reader function the output is right I need tar and English because some lines have two languages.

do u have any solution for that thanks