Belval / TextRecognitionDataGenerator

A synthetic data generator for text recognition
MIT License
3.24k stars 966 forks source link

add dicts for many languages #164

Closed rkcosmos closed 4 years ago

rkcosmos commented 4 years ago

Hello,

First of all, thanks a lot for creating this text generator. We used it for EasyOCR project. I think it's time we give a small contribution back to your project. This PR contains a lot of dictionary for 50+ languages. It's not only my effort but rather a community's work to create OCR system that works for their language.

Thanks a lot, Rakpong

Belval commented 4 years ago

I didn't know EasyOCR used my project, I am very glad that you got some use out of it!

As for the dictionaries that's great! Thank you!

GokulNC commented 3 years ago

@rkcosmos Can you please explain if you made any changes to the code of trdg to get it working properly for all non-Latin-based langauges? Also what are the fonts that you used for each language? (if any)

It'd be really helpful if you can briefly mention about that ЁЯЩВ

Belval commented 3 years ago

@GokulNC you can find font for pretty much all languages on Google Fonts: https://fonts.google.com/

Is there a particular language that you are having trouble with?

GokulNC commented 3 years ago

@Belval Yes. For example, I am trying to generate data for Hindi (Devanagari script).

This is the text:

рджреЗрд╡рддрд╛рддреНрдореЛрдВ рдирддрд┐ рдЬреЛрд╣рд╛рд░реЛ рдкреЛрд╣рдирд╛ рдорд╛рд▓рд┐рдВрдХреЗ

And this is the output: image

My environment:

Any pointers on how to fix would be great!

Edit:

I checked the source code, and seems like I had to enable the --word_split flag. And it worked after that. ЁЯСН Please mention in the README that we have to enable that for Abugida scripts (like Indic languages). Thanks.

iknoorjobs commented 3 years ago

Hi @GokulNC

This does solve the complete problem. There is an issue in Hindi Matra (рдорд╛рддреНрд░рд╛). It gets displaced for many words. See examples

рд╕рд┐рд▓рд┐рдПрдЯ_2 Label: рд╕рд┐рд▓рд┐рдПрдЯ

рдЦрд░рд┐рджрд╡рд╛рдПрдБрдЧреАрдВ_5 Label: рдЦрд░рд┐рджрд╡рд╛рдПрдБрдЧреАрдВ

See the change in what is there in the label as compared to in the image. I have tried different Devanagari fonts but the issue still persists.

But when I try the same word with same font on google, it works fine (see here link)

Any idea @Belval @GokulNC why this is happening? Thanks

iknoorjobs commented 3 years ago

Update: libraqm solves this issue. ЁЯЩВ