UAlbertaALTLab / morphodict

Plains Cree Intelligent Dictionary
https://itwewina.altlab.app/
Apache License 2.0
21 stars 12 forks source link

Clean up the "wordform" model's properties #809

Open eddieantonio opened 3 years ago

eddieantonio commented 3 years ago

The Wordform model represents a wordform in the source language.

A wordform should have the following fields or properties:

A wordform should no longer have the following fields:

TBD for the following field:

@dwhieb please confirm that my linguistic terminology is accurate!

@andrewdotn, does this make sense? We can pair on it if you'd like!

eddieantonio commented 3 years ago

@dwhieb oh no, what is the name of the representative wordform of a lexeme?

e.g., for the lexeme with members alumna, alumnae, alumnī, what is the term for alumnus?

dwhieb commented 3 years ago

In lexicography, the representative wordform for a lexeme is called either a lemma or a head(word|phrase|morpheme). The two terms are synonyms in that context.

BUT, in computational linguistics, lemma essentially means 'linguistic stem'.

To avoid this ambiguity, @aarppe and I's proposal is that we should always call the representative wordform for a lexeme the head (or variations on that), and reserve the term lemma for the FST lemma specifically.

I also realize you'd already settled on the term head for something like this a while back (see here and here), so this proposal really just reaffirms that earlier decision, improves the definition a bit ("representative wordform for a lexeme" is perfect), and clarifies the use of lemma in this project.

I haven't have a chance to present this proposal to y'all and get your feedback yet, so this is still open to discussion. Obviously this would require everybody's approval.

If we adopt this terminology, we should adjust the code base to align with it. Doing so might require changing the names of the is_lemma and/or is_head properties to something else. One suggestion might be major entry (has definitions) and minor entry (does not have definitions), but that's just an idea.

with hope that I didn't just complicate things further,

always & forever, love sincerely,

Danny

aarppe commented 3 years ago

Almost all of the above is in line with our, and my, previous conceptions of the nature and contents of an entry (in specific for individual word-forms), with the exception of stem (linguistic, or FST, or alone).

As far as I've observed, and how our FSTs and dictionary sources (for Plains Cree, Tsuut'ina, and other languages we work with) organize and present their content:

  1. For a computational morphological analyzer (of the FST sort, but generally), the lemma (or alternatively base/basic form) is one of the multiple inflected wordforms that constitute its (inflectional) paradigm (sometimes referred collectively as lexeme), selected to represent that lexeme, i.e. the representative word-form for a lexeme as @dwhieb noted above. Crucially, a lemma can occur independently as a word-form, requiring no further affixation.
  1. For an FST-style computational model, the stem is the "internal" form on which (inflectional) affixation (prefixes, suffixes, infixes) are affixed/inserted. Often, the stem may be equal to the lemma, but sometimes this is not the case, which may vary from language to language (by convention, or linguistic characteristics) and also language-internally. Thus, the stem is not necessarily a free-standing, complete word-form, in comparison to the lemma, which is. Note that a lexeme can have more than one stem.
  1. To complicate matters even further, in lexical databases some linguists (like Arok) leave our (effectively) derivational('ish) morphemes when presenting the stem, showing just the "innermost", minimal stem. Sometimes this minimal stem is referred to as the root of the lexeme/lemma.

To concretize the above, hopefully faithfully to what we have discussed recently with @dwhieb:

a. Entry/head: atâhk --> lemma: atâhk --> stem: atâhkw- (--> root: atâhkw- --> morphemes: /atâhkw-/) b. Entry/head: acâhkos --> lemma: acâhkos --> stem: acâhkos- (--> root: atâhkw- --> morphemes /atâhkw-/ + /-is/)

c. Entry/head: nimîw --> lemma: nimîw --> stem: nimî-- (--> root: nimî-) d. Entry/head: nimînâniwan --> lemma: nimîw --> stem: nimî-- (--> root: nimî-)

e. Entry/head: apiw --> lemma: apiw --> stem: api- (--> root: api-) f. Entry/head: ay-apiw --> lemma: ay-apiw --> stem: ay-api- (--> root: api-) [reduplication]

g. Entry/head: nipâw --> lemma: nipâw --> stem: nipâ- (--> root: nipâ-) h. Entry/head: mâci-nipâw --> lemma: mâci-nipâw --> stem: mâci-nipâ- (--> root: nipâ-) [preverb]

To me, we could consequently use is_lemma and is_head, as follows (which I believe is in line with @eddieantonio notes above):

For the above examples a-h, is_lemma=TRUE for a-c and e-h and is_lemma=FALSE for d, and is_head=TRUE for all cases a-h.

In sum, then, for the purposes of the morphodict database and LEXC generation for our FSTs:

a. lemma would be a complete, free-standing word-form (basic [word]form) that is selected to represent a lexeme (and all its inflected wordforms included in its paradigm). This would correspond to the Finnish perusmuoto and Swedish grundform.

b. fststem would be the string that represents the final stage in word-formation, upon which inflectional affixation is applied. This would correspond to the Finnish (sana)vartalo and Swedish (ord)stam.

In the above I've tried to present a synthesis of the usage I've observed, which has tried to incorporate our many older and recent discussions about the matter, including exposure to Algonquian (Plains Cree) and Dene (e.g. Tsuut'ina) linguistic usage as we have done for years now, as far as I'm aware and have hopefully understood correctly. What is our misfortune here is that linguists (ourselves and myself) haven't been entirely consistent on these terms, either within individual languages nor cross-linguistically - as you may note from above, there appear to be a multitude of terms for a multitude of related concepts, for which neither the terms nor the concepts seem to have been as well defined and consistently used, and many seeming synonyms do not entirely match as to their denotation or connotation. On my part, I must confess the influence of Finnish (and Swedish/Nordic) lexicography here, though the relevance of that tradition to morphologically complex languages and computational morphology is not entirely to be dismissed.