Opus-MT shows great potential with the Arabic language, though i have noticed an issue with the Arabic model caused by the removal of the Arabic Diacritics during preprocessing.
The Arabic Diacritics actually contains various information such as "who", "gender", "command", "time", and much more etc......
Removing the diacritics whether before training or before prediction actually removes lots of information and meaning from the sentence, and can create conflicts between the feature representations or even create destruction of meaning.
who example:
ذهبتُ إلى المدرسة
i went to the school
who+gender example:
ذهبتْ إلى المدرسة
she went to the school
command+gender "male"+ time "future" example:
إفتعِل شيئاً
you do something
gender "male"+ time "past" example:
إفتعَلَ شيئاً
he did something
How does the removal of the Diacritics create conflicts between the feature representations or even create destruction of the meaning:
Simply put, when training the model, it tries to create an understanding of the words, correlations, importance, sequences, dimensions, etc...
The model sees the same word in different sentences to make an understanding, the issue is that when removing the Arabic Diacritics the meaning is changed, lots of information is removed, "time past, present, future" "gender" "command" and much more.... so the model creates it's bases and "meaning, correlation,...." infrastructure on wrong concepts, this causes words that are supposed to be far from each other become closer, sentences structured wrongly and the meaning is "lost in translation".
A single Diacritic can convey lots of information in the same word, with just a single Diacritic!
Opus-MT shows great potential with the Arabic language, though i have noticed an issue with the Arabic model caused by the removal of the Arabic Diacritics during preprocessing. The Arabic Diacritics actually contains various information such as "who", "gender", "command", "time", and much more etc...... Removing the diacritics whether before training or before prediction actually removes lots of information and meaning from the sentence, and can create conflicts between the feature representations or even create destruction of meaning.
who example: ذهبتُ إلى المدرسة i went to the school
who+gender example: ذهبتْ إلى المدرسة she went to the school
command+gender "male"+ time "future" example: إفتعِل شيئاً you do something
gender "male"+ time "past" example: إفتعَلَ شيئاً he did something
How does the removal of the Diacritics create conflicts between the feature representations or even create destruction of the meaning: Simply put, when training the model, it tries to create an understanding of the words, correlations, importance, sequences, dimensions, etc... The model sees the same word in different sentences to make an understanding, the issue is that when removing the Arabic Diacritics the meaning is changed, lots of information is removed, "time past, present, future" "gender" "command" and much more.... so the model creates it's bases and "meaning, correlation,...." infrastructure on wrong concepts, this causes words that are supposed to be far from each other become closer, sentences structured wrongly and the meaning is "lost in translation". A single Diacritic can convey lots of information in the same word, with just a single Diacritic!