CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
MIT License
415 stars 73 forks source link

[BUG] When running the tagger, some words are missing features like 'lex' and 'diac' for lev and glf pretrained models #142

Open fadhleryani opened 7 months ago

fadhleryani commented 7 months ago

For lev, for a word like 'حقوق' for example, when I run the following: BERTUnfactoredDisambiguator.pretrained(model_name='glf').tag_sentence('يشسي'.split())

it returns:

[{'pos': 'noun',
  'prc3': '0',
  'prc2': '0',
  'prc1': '0',
  'prc0': 'Al_det',
  'per': 'na',
  'asp': 'na',
  'vox': 'na',
  'mod': 'no',
  'form_gen': 'm',
  'form_num': 's',
  'stt': 'no',
  'cas': 'no',
  'enc0': '0',
  'enc1': '0',
  'enc2': '0'}]

Even for words with no analysis the expected behavior is to backoff to the original word right, so this is def a bug sa7?

For glf, try the word 'شئ' and you'll get something without lex and diac.

Desktop (please complete the following information):