Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
592 stars 71 forks source link

Arabic Diacritics "the meaning is lost in translation" #41

Closed seekingdeep closed 3 years ago

seekingdeep commented 3 years ago

Opus-MT shows great potential with the Arabic language, though i have noticed an issue with the Arabic model caused by the removal of the Arabic Diacritics during preprocessing. The Arabic Diacritics actually contains various information such as "who", "gender", "command", "time", and much more etc...... Removing the diacritics whether before training or before prediction actually removes lots of information and meaning from the sentence, and can create conflicts between the feature representations or even create destruction of meaning.

who example: ذهبتُ إلى المدرسة i went to the school

who+gender example: ذهبتْ إلى المدرسة she went to the school

command+gender "male"+ time "future" example: إفتعِل شيئاً you do something

gender "male"+ time "past" example: إفتعَلَ شيئاً he did something

How does the removal of the Diacritics create conflicts between the feature representations or even create destruction of the meaning: Simply put, when training the model, it tries to create an understanding of the words, correlations, importance, sequences, dimensions, etc... The model sees the same word in different sentences to make an understanding, the issue is that when removing the Arabic Diacritics the meaning is changed, lots of information is removed, "time past, present, future" "gender" "command" and much more.... so the model creates it's bases and "meaning, correlation,...." infrastructure on wrong concepts, this causes words that are supposed to be far from each other become closer, sentences structured wrongly and the meaning is "lost in translation". A single Diacritic can convey lots of information in the same word, with just a single Diacritic!