Clean up the "wordform" model's properties

eddieantonio commented 3 years ago

The Wordform model represents a wordform in the source language.

A wordform should have the following fields or properties:

[ ] is_head: this wordform has explicit defintions. is_head == True iff the wordform has one or more definitions
[ ] word_class or specific_word_class — which could be a foreign key that references a SpecificWordClass model

A wordform should no longer have the following fields:

[ ] as_is
[ ] pos

TBD for the following field:

[ ] is_lemma: a lemma is the representative wordform for a lexeme. The codebase as of 2021-05-19 does not use the is_lemma field to mean this; however, each wordform's lemma field is a foreign key that references the representative Wordform; lemma representative wordforms reference themselves (self reference). Django magic allows you to do my_wordform.inflections to get all stored inflections of the lexeme!

@dwhieb please confirm that my linguistic terminology is accurate!

@andrewdotn, does this make sense? We can pair on it if you'd like!

eddieantonio commented 3 years ago

@dwhieb oh no, what is the name of the representative wordform of a lexeme?

e.g., for the lexeme with members alumna, alumnae, alumnī, what is the term for alumnus?

dwhieb commented 3 years ago

In lexicography, the representative wordform for a lexeme is called either a lemma or a head(word|phrase|morpheme). The two terms are synonyms in that context.

BUT, in computational linguistics, lemma essentially means 'linguistic stem'.

To avoid this ambiguity, @aarppe and I's proposal is that we should always call the representative wordform for a lexeme the head (or variations on that), and reserve the term lemma for the FST lemma specifically.

I also realize you'd already settled on the term head for something like this a while back (see here and here), so this proposal really just reaffirms that earlier decision, improves the definition a bit ("representative wordform for a lexeme" is perfect), and clarifies the use of lemma in this project.

I haven't have a chance to present this proposal to y'all and get your feedback yet, so this is still open to discussion. Obviously this would require everybody's approval.

If we adopt this terminology, we should adjust the code base to align with it. Doing so might require changing the names of the is_lemma and/or is_head properties to something else. One suggestion might be major entry (has definitions) and minor entry (does not have definitions), but that's just an idea.

with hope that I didn't just complicate things further,

always & forever, love sincerely,

Danny

aarppe commented 3 years ago

Almost all of the above is in line with our, and my, previous conceptions of the nature and contents of an entry (in specific for individual word-forms), with the exception of stem (linguistic, or FST, or alone).

As far as I've observed, and how our FSTs and dictionary sources (for Plains Cree, Tsuut'ina, and other languages we work with) organize and present their content:

For a computational morphological analyzer (of the FST sort, but generally), the lemma (or alternatively base/basic form) is one of the multiple inflected wordforms that constitute its (inflectional) paradigm (sometimes referred collectively as lexeme), selected to represent that lexeme, i.e. the representative word-form for a lexeme as @dwhieb noted above. Crucially, a lemma can occur independently as a word-form, requiring no further affixation.

For single-word dictionary entries, the head(word) would equal the lemma (for FST purposes), when the head is a lemma (i.e. the designated basic/base form for a lexeme). For single-word dictionary entries where the head(word) is not equal to any lemma, the associated lemma is the lemma for the lexeme of which this head(word) is a member (which can be derived via the FST-morphological analysis of the headword, when the lemma yielded by the analysis is not equal to the analyzed wordform).

For an FST-style computational model, the stem is the "internal" form on which (inflectional) affixation (prefixes, suffixes, infixes) are affixed/inserted. Often, the stem may be equal to the lemma, but sometimes this is not the case, which may vary from language to language (by convention, or linguistic characteristics) and also language-internally. Thus, the stem is not necessarily a free-standing, complete word-form, in comparison to the lemma, which is. Note that a lexeme can have more than one stem.

To confuse matters, early computational attempts at establishing a common label for a lexeme (and associated wordforms) implemented stemming rather than proper lemmatization, effectively stripping off the suffixes (and maybe prefixes), which might have resulted in a free-standing word-form (presumably the lemma), or not (potentially being the stem, but not necessarily so).
To confuse matters further, some lexicographical resources designate the stem as the head of an entry for a lexeme (e.g. in the Cree-to-English direction for the printed version of Arok's CW). Moreover, while all of the FSTs we have been creating produce a lemma (or basic/baseform), I think I've seen a few morphological analyzers where the stem is produced instead - computational linguists may not have been fully consistent here.
Even more unfortunately and confusingly, sometimes base or base form (at least in English lexicography) refers to the stem upon which not only inflectional but derivational morphemes are affixed. Thus, lemma in the CL sense would be the safer choice.

To complicate matters even further, in lexical databases some linguists (like Arok) leave our (effectively) derivational('ish) morphemes when presenting the stem, showing just the "innermost", minimal stem. Sometimes this minimal stem is referred to as the root of the lexeme/lemma.

In order for the FST to be able to undertake affixation properly, we need to specify the full stem (with all derivational morphemes included), which we have called the fststem.
We have lacked a definition for stem in our glossary, but the definition for root is pretty much as described above, and extended to morphologically more complex words (with multiple constituent non-inflectional morphemes) would apply to stem as well.

To concretize the above, hopefully faithfully to what we have discussed recently with @dwhieb:

a. Entry/head: atâhk --> lemma: atâhk --> stem: atâhkw- (--> root: atâhkw- --> morphemes: /atâhkw-/) b. Entry/head: acâhkos --> lemma: acâhkos --> stem: acâhkos- (--> root: atâhkw- --> morphemes /atâhkw-/ + /-is/)

c. Entry/head: nimîw --> lemma: nimîw --> stem: nimî-- (--> root: nimî-) d. Entry/head: nimînâniwan --> lemma: nimîw --> stem: nimî-- (--> root: nimî-)

e. Entry/head: apiw --> lemma: apiw --> stem: api- (--> root: api-) f. Entry/head: ay-apiw --> lemma: ay-apiw --> stem: ay-api- (--> root: api-) [reduplication]

g. Entry/head: nipâw --> lemma: nipâw --> stem: nipâ- (--> root: nipâ-) h. Entry/head: mâci-nipâw --> lemma: mâci-nipâw --> stem: mâci-nipâ- (--> root: nipâ-) [preverb]

To me, we could consequently use is_lemma and is_head, as follows (which I believe is in line with @eddieantonio notes above):

is_lemma=TRUE, if the headword for a single-word dictionary entry is equal to any lemma.
is_lemma=FALSE, if the headword for a single-word dictionary entry is not equal to any lemma (which would conveniently apply to phrases and morphemes as well, that are not lemmas).
is_head=TRUE, if a word-form (in the morphodict-internal database) is the head(word) of a dictionary entry (which may be a lemma or some other wordform), in which case there should be a human-produced definition/translation. Also, is_head=TRUE for a multiword phrases as well as morphemes (for which is_lemma=FALSE).
is_head=FALSE, if a word-form (in the morphodict-internal database) is not the head(word) of a dictionary entry. This would apply to all the (inflected) word-forms that are generated with an FST for all the lemma-head entries, at the importation of the dictionary content.

For the above examples a-h, is_lemma=TRUE for a-c and e-h and is_lemma=FALSE for d, and is_head=TRUE for all cases a-h.

In sum, then, for the purposes of the morphodict database and LEXC generation for our FSTs:

a. lemma would be a complete, free-standing word-form (basic [word]form) that is selected to represent a lexeme (and all its inflected wordforms included in its paradigm). This would correspond to the Finnish perusmuoto and Swedish grundform.

b. fststem would be the string that represents the final stage in word-formation, upon which inflectional affixation is applied. This would correspond to the Finnish (sana)vartalo and Swedish (ord)stam.

Currently, the fststem is determined in the language-specific dictionary database, and the only thing our intelligent dictionaries should do is to show the fststem, when asked. One should note that not all dictionary entries necessarily have an fststem, and for some languages stems, and hence fststems, might be lacking entirely.
We will eventually want to incorporate and import also some combination of the roots, constituent morphemes, and "enhanced" stems with the morphophonologically special characters included, but that is for later.

In the above I've tried to present a synthesis of the usage I've observed, which has tried to incorporate our many older and recent discussions about the matter, including exposure to Algonquian (Plains Cree) and Dene (e.g. Tsuut'ina) linguistic usage as we have done for years now, as far as I'm aware and have hopefully understood correctly. What is our misfortune here is that linguists (ourselves and myself) haven't been entirely consistent on these terms, either within individual languages nor cross-linguistically - as you may note from above, there appear to be a multitude of terms for a multitude of related concepts, for which neither the terms nor the concepts seem to have been as well defined and consistently used, and many seeming synonyms do not entirely match as to their denotation or connotation. On my part, I must confess the influence of Finnish (and Swedish/Nordic) lexicography here, though the relevance of that tradition to morphologically complex languages and computational morphology is not entirely to be dismissed.

UAlbertaALTLab / morphodict

Clean up the "wordform" model's properties #809