Open TAQBIBT opened 5 years ago
It has been a while since I ran that training and I don't have the files saved.
Going by the commits in the git repo - ie.
https://github.com/Shreeshrii/tessdata_shreetest/commit/67b9593bd6ee010031c8bb42fbb1fcbdd212e05a
https://github.com/Shreeshrii/tessdata_shreetest/commit/c50e3a36ff6c5519c69cd771497a722f9b9c3123
https://github.com/Shreeshrii/tessdata_shreetest/commit/4e706d1df3527902243cb124ec36f18558d508a8
I think it was based on finetuning (for impact) the tessdata_best/script/Arabic model. I had added Arabic comma and other punctuation to the training_text and not included the English letters [a-zA-Z] in the unicharset. The font used was most probably Arial Unicode MS
.
Please see https://github.com/tesseract-ocr/tessdata/issues/70
Possibly I used the fonts recommened on that page - Roya, Nazanin etc.
Thanks Shreeshrii please I can not train the Arabic language with OCRD-train that you have proposed on this link: https://github.com/Shreeshrii/ocrd-train are tiff and gt.txt files prepared like LTR languages or not? can i start with traineddata that you have proposed example fas-script-float?
Sorry for the inconvenience
fas-script-float is for Persian/Farsi. The numerals for Farsi and Arabic are different. But it is a float model, similar to the tessdata_best and can be used as base for further training.
Regarding ocrd-train, I only have a fork of the project, with a suggested change to makefile to use 'wordstrbox' option for creating box files for complex scripts.
However, I have not personally tried it for Arabic, as I do not know the language/script and so it is difficult for me to ascertain that it us working correctly.
Thanks for uploading this trained model - could you possibly provide some info about the training data?
Specifically the fonts used and the text used for fas-script-float
Thanks!