Training for Arabic - Githubissues

m0beenali commented 4 years ago

Hi @Shreeshrii ! I've been working on tesseract for a month now and i've been following you since then. I'd some questions regarding training which i couldn't. With due respect following are the questions:

I've seen your training text files and those are enormously huge around 4000 plus lines, where i just trained using only 200 more or less lines of text. What number of lines is good enough for creating traineddata
I wanted to know how much time did it took to complete training using the ara.minusnew.training_text file and what machine do you use?
Also should i use your tesstrain files which you have provided in this repo build/tesstrain_multi.sh because using the default tesstrain.sh i am getting arabic numerals reversed after finetuning the tessdata_best/ara.traineddata file

Shreeshrii commented 4 years ago

For Tesseract 4 (neural net version) training text has to be very large - see the comments by Ray Smith in training wiki.

Arabic training_text for some reason has not been updated by him in langdata_lstm repo. So, it does not reflect the actual text he used.

I don't have notes/logs from the traineddata for Arabic that I uploaded. But since these are only finetune trainings, it would have been a few hours/days at max.

Yes, there is a problem with getting Arabic numerals and punctuation correct. They are treated as LTR in a RTL language and have probably not been handled correctly in code.

I ran these trainings as experiment ( I do not know Arabic). You can clone the repo and run the bash scripts. You may have to set the parameters on top of script correctly to run all phases of training.

On Tue, Nov 26, 2019 at 4:04 PM Mobeen Ali notifications@github.com wrote:

Hi @Shreeshrii https://github.com/Shreeshrii ! I've been working on tesseract for a month now and i've been following you since then. I'd some questions regarding training which i couldn't. With due respect following are the questions:

-

I've seen your training text files and those are enormously huge around 4000 plus lines, where i just trained using only 200 more or less lines of text. What number of lines is good enough for creating traineddata

I wanted to know how much time did it took to complete training using the ara.minusnew.training_text file and what machine do you use?

Also should i use your tesstrain files which you have provided in this repo build/tesstrain_multi.sh because using the default tesstrain.sh i am getting arabic numerals reversed after finetuning the tessdata_best/ara.traineddata file

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Shreeshrii/tessdata_arabic/issues/1?email_source=notifications&email_token=ABG37I7H5DWE2LQEX7YBWH3QVT3RTA5CNFSM4JRV3FT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H4CJWGQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I77OOV3Y2JRVBOOSXTQVT3RTANCNFSM4JRV3FTQ .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

m0beenali commented 4 years ago

I'll run the scripts and see if i get my issue resolved and will post the outcome so it may help others. Thanks alot!

Shreeshrii / tessdata_arabic

Training for Arabic #1

I've seen your training text files and those are enormously huge around 4000 plus lines, where i just trained using only 200 more or less lines of text. What number of lines is good enough for creating traineddata

I wanted to know how much time did it took to complete training using the ara.minusnew.training_text file and what machine do you use?