Harryoung / Test

0 stars 0 forks source link

Best practice for training an Arabic recognition model #8

Open Harryoung opened 10 months ago

Harryoung commented 10 months ago

Let's say I want to train an Arabic recognition model, what's the best practice when creating a customized Arabic dictionary?\nNow, there are several things that make it challenging:\nArabic letters change their shapes depending on their locations in the word, for example, the letter alif has 4 forms and each one has a unicode glyph. Should I include all possible shapes of it in the dictionary or should I just include a single letter in the alphabet?\nfollow-up on 1, if I only include a single letter, then how is the model trained such that it can recognize different shapes of the same letter? It sounds like a 1-to-many mapping, can the model do that?\nArabic is cursive, that means when joining letters together, they merge together, which is called ligature. How can I take this into account when creating the dictionary?\nWhat's the order of paddleOCR recognition? Because Arabic is a right-to-left language, and if paddleocr reads texts from left to right, should I be concerned and are there any files that I should change?