Shreeshrii / tessdata_shreetest

finetuned traineddata files for tesseract 4.0.0 for testing
156 stars 32 forks source link

the fonts fas traineddata #10

Open TAQBIBT opened 5 years ago

TAQBIBT commented 5 years ago

Thanks for uploading this trained model - could you possibly provide some info about the training data?

Specifically the fonts used and the text used for fas-script-float

Thanks!

Shreeshrii commented 5 years ago

It has been a while since I ran that training and I don't have the files saved.

Going by the commits in the git repo - ie.

https://github.com/Shreeshrii/tessdata_shreetest/commit/67b9593bd6ee010031c8bb42fbb1fcbdd212e05a

https://github.com/Shreeshrii/tessdata_shreetest/commit/c50e3a36ff6c5519c69cd771497a722f9b9c3123

https://github.com/Shreeshrii/tessdata_shreetest/commit/4e706d1df3527902243cb124ec36f18558d508a8

I think it was based on finetuning (for impact) the tessdata_best/script/Arabic model. I had added Arabic comma and other punctuation to the training_text and not included the English letters [a-zA-Z] in the unicharset. The font used was most probably Arial Unicode MS.

Shreeshrii commented 5 years ago

Please see https://github.com/tesseract-ocr/tessdata/issues/70

Possibly I used the fonts recommened on that page - Roya, Nazanin etc.

anergui commented 5 years ago

Thanks Shreeshrii please I can not train the Arabic language with OCRD-train that you have proposed on this link: https://github.com/Shreeshrii/ocrd-train are tiff and gt.txt files prepared like LTR languages or not? can i start with traineddata that you have proposed example fas-script-float?

Sorry for the inconvenience

Shreeshrii commented 5 years ago

fas-script-float is for Persian/Farsi. The numerals for Farsi and Arabic are different. But it is a float model, similar to the tessdata_best and can be used as base for further training.

Regarding ocrd-train, I only have a fork of the project, with a suggested change to makefile to use 'wordstrbox' option for creating box files for complex scripts.

However, I have not personally tried it for Arabic, as I do not know the language/script and so it is difficult for me to ascertain that it us working correctly.