CAMeL-Lab / camel_morph

Camel Morph’s goal is to build large open-source morphological models for Arabic and its dialects across many genres and domains.
MIT License
6 stars 3 forks source link

Empty features `ud` and `catib` #2

Open mirkovogel opened 1 month ago

mirkovogel commented 1 month ago

The following observation concerns the LREC-Coling 2024 release (camel_morph/official_releases/lrec-coling2024_release/databases/camel-morph-msa):

The features catib6 and ud are always empty, e.g. in the following analysis of "فبسبب":

{
  'bw': 'فَ/CONJ+بِ/PREP+سَبَب/NOUN+ِ/CASE_DEF_GEN',
  'ud': '',
  'catib6': ''
}

The expected values are:

{
  'ud': 'CCONJ+ADP+NOUN ',
  'catib6': 'PRT+PRT+NOM'
}
mirkovogel commented 1 month ago

Comment from @christios by mail:

As you've rightly pointed out, ud and catib are missing as we did not include those in the release (it was not our focus). But you are right they should be included in the next release. It should not be very difficult, probably just a mapping between the CAPHI POS (or Catib) and UD.

mirkovogel commented 1 month ago

I am currently working on transitioning my pipeline to from the r13 morphological db to Camel Morph MSA, and need both catib6 and ud tags downstream, So I'd volunteer to help with this, if I can.

Maybe there already is code to convert between the "native" pos tags of the database (https://camel-tools.readthedocs.io/en/latest/reference/camel_morphology_features.html?) to other tag sets, I could use in the meantime?