Closed apbernhard closed 1 year ago
Thanks for your question!
Dagger Alif is considered a diacritic in Arabic orthography and it is modeled as such, therefore it is handled by the dediacritization utility. Currently, there is no utility that would normalize Quranic orthography into standard orthography, there is more to it than just dagger alef, e.g. الصلوة
, رحمت
, ءالاء
, ... etc. This is an involved process.
All our resources that are used in cameltools deal with standard spelling including the fossilized spellings of the forms you mentioned and more like رحمن
and إسحق
.
It is also worth noting that roots among other lexical features are not generated on the fly, all features are predefined in the morphological analyzer database. In the example you gave, you got the root س.م.ت
because there exist an entry in the database of the lemma سَمْت
that happens to have the plural سُمُوت
and has the root س.م.ت
.
For more accurate results for tagging in general I suggest using BERTUnfactoredDisambiguator
if you have the resources.
Dear CAMeL Tools team,
First of all thanks for providing this very useful toolset.
Context I'm working on tafsirs. Quranic orthography stands next to text in MSA orthograhpy. I'd like to use CAMeL Tools for POS tagging, lemmatization and root identification.
Problem Dagger Alef is not covered by
normalize_alef_ar
and is treated bydediac_ar
like any other tashkeel. Therefore it's not transposed to the orthographic variant that'd allowMLEDisambiguator
to properly identify its root etc.Example
ٱلسَّمَٰوَٰتِ
: the normalization producesالسموت
instead ofالسموات
and therefore the rootس.م.و
cannot be identified properly (insteadس.م.ت
is given).Still, in other cases where the orthographic particularity has been retained in MSA (like
هٰذا
orأولٰئك
) it shouldn't be transposed as theMLEDisambiguator
expects that input.Do you have an idea how I could tackle this issue? Thank you very much in advance.