CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
MIT License
413 stars 73 forks source link

[QUESTION] Handling of Dagger Alef in MSA and CA #114

Closed apbernhard closed 1 year ago

apbernhard commented 1 year ago

Dear CAMeL Tools team,

First of all thanks for providing this very useful toolset.

Context I'm working on tafsirs. Quranic orthography stands next to text in MSA orthograhpy. I'd like to use CAMeL Tools for POS tagging, lemmatization and root identification.

Problem Dagger Alef is not covered by normalize_alef_ar and is treated by dediac_ar like any other tashkeel. Therefore it's not transposed to the orthographic variant that'd allow MLEDisambiguator to properly identify its root etc.

Example ٱلسَّمَٰوَٰتِ: the normalization produces السموت instead of السموات and therefore the root س.م.و cannot be identified properly (instead س.م.ت is given).

Still, in other cases where the orthographic particularity has been retained in MSA (like هٰذا or أولٰئك) it shouldn't be transposed as the MLEDisambiguator expects that input.

Do you have an idea how I could tackle this issue? Thank you very much in advance.

System:
OS Windows 10
Python 3.9.4
CAMeL Tools 1.5.2
slkh commented 1 year ago

Thanks for your question!

Dagger Alif is considered a diacritic in Arabic orthography and it is modeled as such, therefore it is handled by the dediacritization utility. Currently, there is no utility that would normalize Quranic orthography into standard orthography, there is more to it than just dagger alef, e.g. الصلوة, رحمت, ءالاء, ... etc. This is an involved process. All our resources that are used in cameltools deal with standard spelling including the fossilized spellings of the forms you mentioned and more like رحمن and إسحق.

It is also worth noting that roots among other lexical features are not generated on the fly, all features are predefined in the morphological analyzer database. In the example you gave, you got the root س.م.ت because there exist an entry in the database of the lemma سَمْت that happens to have the plural سُمُوت and has the root س.م.ت.

For more accurate results for tagging in general I suggest using BERTUnfactoredDisambiguator if you have the resources.