Ensure "FST output style" is consistent, even if the FST has no output

UAlbertaALTLab / morphodict

Plains Cree Intelligent Dictionary

https://itwewina.altlab.app/

Apache License 2.0

21 stars 12 forks source link

Ensure "FST output style" is consistent, even if the FST has no output #815

Open eddieantonio opened 3 years ago

eddieantonio commented 3 years ago

There's a few things here:

the database importer seems to want to create a valid analysis string for every entry
entries to be imported may not be analyzable by the FST (e.g., pê-)
the importer synthesizes an analysis from its declared wordclass

In the default case, the "FST output style" Title Cases all tags that are not noun or verb word classes:

https://github.com/UAlbertaALTLab/cree-intelligent-dictionary/blob/5b7ffa5f9ac1c649d2658e6b05b93714862d7a77/src/CreeDictionary/utils/enums.py#L102-L103

Is this... a good assumption? Should we change this assumption? Should we be synthesizing an analysis at all? Will the WordClass enum be scrapped in @andrewdotn's language generalization port?

Related: #814 — +Ipv was being generated here, although crkeng.xml has it as "IPV". This resulted in a failing test case.

aarppe commented 3 years ago

[Switched pointer to paradigms to be (specific) wordclass]

Couldn't we just have as a possible value in the (FST) analysis field something like NULL, i.e. no analysis?

The inflectional category is a property of the dictionary database, not the FST. If the application of the FST to an entry head is successful, that gives an us a lemma and an analysis; if not, we have neither analysis nor lemma for the entry head. However, all (non-generated) entry heads in the dictionary will have a specific woldclass (and an inflectional category, which both in CW look similar but should be considered distinct), even if when we cannot analyze them with the FST. Some specific wordclasses are associated with a dynamic paradigm, others aren't. Some entry heads are associated with a static paradigm (in which case one doesn't need to try to figure out a dynamic paradigm); others aren't, in which case we try to resolve the dynamic paradigm based on the specific wordclass.

eddieantonio commented 3 years ago

Couldn't we just have as a possible value in the (FST) analysis field something like NULL, i.e. no analysis?

Yeah, honestly, I'm not sure why the current dictionary importer insists on creating an analysis string for every entry (@andrewdotn 👀 👀 👀 ). I can't think of why it would be necessary.

The inflectional category is a property of the dictionary database, not the FST. If the application of the FST to an entry head is successful, that gives an us a lemma and an analysis; if not, we have neither analysis nor lemma for the entry head. However, all (non-generated) entry heads in the dictionary will have an inflectional category, even if when we cannot analyze them with the FST. Some inflectional categories are associated with a dynamic paradigm, others aren't. Some entry heads are associated with a static paradigm (in which case one doesn't try to figure out a dynamic paradigm); others aren't, in which case we try to resolve the dynamic paradigm based on the inflectional paradigm.

This is good to know!

andrewdotn commented 3 years ago

To answer this conclusively, I’d have to spend more time with the code figuring out where exactly the analysis is used. But I do agree that, conceptually, it seems totally reasonable to leave the analysis field blank when the FST can’t analyze a wordform.

aarppe commented 3 years ago

To follow up with a conclusive opinion, we should indicate an FST-analysis only when the normative analyzer provides an analysis. If the FST is not able to analyze the head of an entry, which by design is the case for morphemes (like preverbs) and (mostly) phrases (when using a normative FST), then there should not be an analysis, and there should be no attempt at generating such an analysis - the analysis should be NULL. If there is no analysis (the analysis field is NULL), there is no (FST-)lemma (the corresponding field is NULL), and there should not be an attempt for dynamic paradigm generation based on the analysis lemma and specific word class associated with that (of course, with static paradigm generation, one might still allow for that).

If the current code expects an analysis for every entry, that is categorically incorrect and misguided behavior - I do not understand any linguistic or other reasons why that should need to be the case - and then that should simply be purged, and the rest of the code be revised to be happy with NULL analyses.