CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
MIT License
413 stars 73 forks source link

MA for verbs broken (only active voice + unknown mood) #102

Closed mirkovogel closed 1 year ago

mirkovogel commented 2 years ago

In several setups (see below), the MA for verbs is broken: Only a single analysis with vox=a and mod=u is generated, the analyses for passive voice and indicative / subjunctive / jussive mood are missing. (I'm wondering which code base the CALIMA Star analyser web interface at https://calimastar.abudhabi.nyu.edu/analyzer/ is using, because it's displaying six analyses, as expected.)

Example

For the input "يوظف", the following single analysis is returned:

{'diac': 'يُوَظِّف', 'lex': 'وَظَّف', 'bw': 'يُ/IV3MS+وَظِّف/IV', 'gloss': 'he;it+hire;employ', 'pos': 'verb', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': '3', 'asp': 'i', 'vox': 'a', 'mod': 'u', 'stt': 'na', 'cas': 'na', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'd3seg': 'يُوَظِّف', 'caphi': 'y_u_w_a_dh._dh._i_f', 'd1tok': 'يُوَظِّف', 'd2tok': 'يُوَظِّف', 'pos_logprob': -1.023208, 'd3tok': 'يُوَظِّف', 'd2seg': 'يُوَظِّف', 'pos_lex_logprob': -5.22446, 'num': 's', 'ud': 'VERB', 'gen': 'm', 'catib6': 'VRB', 'root': '#.ظ.ف', 'bwtok': 'يُ+_وَظِّف', 'pattern': 'يُوَ2ِّ3', 'lex_logprob': -5.22446, 'atbtok': 'يُوَظِّف', 'atbseg': 'يُوَظِّف', 'd1seg': 'يُوَظِّف', 'stem': 'وَظِّف', 'stemgloss': 'hire;employ', 'stemcat': 'IV_yu'} 

Test setup

I got this behavior with the following setups:

from camel_tools.morphology.database import MorphologyDB
from camel_tools.morphology.analyzer import Analyzer

# First, we need to load a morphological database.
# Here, we load the default database which is used for analyzing
# Modern Standard Arabic. 
db = MorphologyDB.builtin_db()

analyzer = Analyzer(db)

analyses = analyzer.analyze('يوظف')

for analysis in analyses:
    print(analysis, '\n')
owo commented 1 year ago

Calima Star uses a database that is not open source unfortunately. If you have access to SAMA 3.1, I can provide instructions on how to get access to the database.