New mode required - Githubissues

hectoralos commented 4 years ago

A new Chuvash grammar textbook is being prepared on the basis of a 3M+ word corpus and our morphological analysis. The author is asking for a composite output of modes chv-morph and chv-segment in which he could more easily search for specific surface forms of morphems.

For instance currently we have these two analysis for ачисен:

$ echo "ачисен" | apertium -d . chv-morph
^ачисен/ача<n><px3sp><pl><gen>$^./.<sent>$

$ echo "ачисен" | apertium -d . chv-segment
^ачисен/ач>и>се>н$^./.$

He is asking for something like this:

^ачисен/ача<n>и<px3sp>се<pl>н<gen>$

This request seems not illogical and probably can be useful for other people and languages.

Could this more or less easily be done?

jonorthwash commented 4 years ago

I can think of two ways to do this, and both are hard:

Get the output of chv-morph and chv-segment and match things programmatically. This will be difficult in one-to-many mappings of tags to morphemes, though (e.g., <p3><sg>). Perhaps it could be done with a list of morphemes that have multiple tags (and tags that have multiple morphemes?).
Rewrite the transducer and add phonological processing to the analysis side. So for instance, we could have an line like this in lexc:
```
ӗ<px3sp>:%{ӗ%}   PLURAL ;
```
But we'd need to have the phonology make the ӗ on the left into и, given your example, which would require an extra twol transducer intersected with the analysis side of the transducer, and it would have to do weird things like look through the <n> tag.

I note that you don't enitrely apply phonology in the example, though (you have ача, not ач), so the questions become:

what are the exact requirements?, and
could something a little less close to the requirements but a lot easier be okay?

hectoralos commented 4 years ago

I copy-paste Artem Fedorinqyk's answer (Artem is the person who asked for this enhancement):

Maybe ^ачисен/ача<n>и<px3sp>сен<pl><gen>$ will be easier? Let it be no one-to-many mapping, but at least at the formal level we certainly can get radical "ача" and affixes "и" and "сен" and give each of them some meaning.

hectoralos commented 4 years ago

Here is another example. If possible, Artem would like something similar to this output:

^юратӑвӗ/юрату<n>ӗ<px3sp>$

apertium / apertium-chv

New mode required #32