Open eddieantonio opened 3 years ago
@dwhieb oh no, what is the name of the representative wordform of a lexeme?
e.g., for the lexeme with members alumna, alumnae, alumnī, what is the term for alumnus?
In lexicography, the representative wordform for a lexeme is called either a lemma or a head(word|phrase|morpheme). The two terms are synonyms in that context.
BUT, in computational linguistics, lemma essentially means 'linguistic stem'.
To avoid this ambiguity, @aarppe and I's proposal is that we should always call the representative wordform for a lexeme the head (or variations on that), and reserve the term lemma for the FST lemma specifically.
I also realize you'd already settled on the term head for something like this a while back (see here and here), so this proposal really just reaffirms that earlier decision, improves the definition a bit ("representative wordform for a lexeme" is perfect), and clarifies the use of lemma in this project.
I haven't have a chance to present this proposal to y'all and get your feedback yet, so this is still open to discussion. Obviously this would require everybody's approval.
If we adopt this terminology, we should adjust the code base to align with it. Doing so might require changing the names of the is_lemma
and/or is_head
properties to something else. One suggestion might be major entry (has definitions) and minor entry (does not have definitions), but that's just an idea.
with hope that I didn't just complicate things further,
always & forever, love sincerely,
Danny
Almost all of the above is in line with our, and my, previous conceptions of the nature and contents of an entry (in specific for individual word-forms), with the exception of stem (linguistic, or FST, or alone).
As far as I've observed, and how our FSTs and dictionary sources (for Plains Cree, Tsuut'ina, and other languages we work with) organize and present their content:
To confuse matters, early computational attempts at establishing a common label for a lexeme (and associated wordforms) implemented stemming rather than proper lemmatization, effectively stripping off the suffixes (and maybe prefixes), which might have resulted in a free-standing word-form (presumably the lemma), or not (potentially being the stem, but not necessarily so).
To confuse matters further, some lexicographical resources designate the stem as the head of an entry for a lexeme (e.g. in the Cree-to-English direction for the printed version of Arok's CW). Moreover, while all of the FSTs we have been creating produce a lemma (or basic/baseform), I think I've seen a few morphological analyzers where the stem is produced instead - computational linguists may not have been fully consistent here.
Even more unfortunately and confusingly, sometimes base or base form (at least in English lexicography) refers to the stem upon which not only inflectional but derivational morphemes are affixed. Thus, lemma in the CL sense would be the safer choice.
In order for the FST to be able to undertake affixation properly, we need to specify the full stem (with all derivational morphemes included), which we have called the fststem.
We have lacked a definition for stem in our glossary, but the definition for root is pretty much as described above, and extended to morphologically more complex words (with multiple constituent non-inflectional morphemes) would apply to stem as well.
To concretize the above, hopefully faithfully to what we have discussed recently with @dwhieb:
a. Entry/head: atâhk --> lemma: atâhk --> stem: atâhkw- (--> root: atâhkw- --> morphemes: /atâhkw-/) b. Entry/head: acâhkos --> lemma: acâhkos --> stem: acâhkos- (--> root: atâhkw- --> morphemes /atâhkw-/ + /-is/)
c. Entry/head: nimîw --> lemma: nimîw --> stem: nimî-- (--> root: nimî-) d. Entry/head: nimînâniwan --> lemma: nimîw --> stem: nimî-- (--> root: nimî-)
e. Entry/head: apiw --> lemma: apiw --> stem: api- (--> root: api-) f. Entry/head: ay-apiw --> lemma: ay-apiw --> stem: ay-api- (--> root: api-) [reduplication]
g. Entry/head: nipâw --> lemma: nipâw --> stem: nipâ- (--> root: nipâ-) h. Entry/head: mâci-nipâw --> lemma: mâci-nipâw --> stem: mâci-nipâ- (--> root: nipâ-) [preverb]
To me, we could consequently use is_lemma
and is_head
, as follows (which I believe is in line with @eddieantonio notes above):
For the above examples a-h, is_lemma=TRUE for a-c and e-h and is_lemma=FALSE for d, and is_head=TRUE for all cases a-h.
In sum, then, for the purposes of the morphodict database and LEXC generation for our FSTs:
a. lemma would be a complete, free-standing word-form (basic [word]form) that is selected to represent a lexeme (and all its inflected wordforms included in its paradigm). This would correspond to the Finnish perusmuoto and Swedish grundform.
b. fststem would be the string that represents the final stage in word-formation, upon which inflectional affixation is applied. This would correspond to the Finnish (sana)vartalo and Swedish (ord)stam.
Currently, the fststem is determined in the language-specific dictionary database, and the only thing our intelligent dictionaries should do is to show the fststem, when asked. One should note that not all dictionary entries necessarily have an fststem, and for some languages stems, and hence fststems, might be lacking entirely.
We will eventually want to incorporate and import also some combination of the roots, constituent morphemes, and "enhanced" stems with the morphophonologically special characters included, but that is for later.
In the above I've tried to present a synthesis of the usage I've observed, which has tried to incorporate our many older and recent discussions about the matter, including exposure to Algonquian (Plains Cree) and Dene (e.g. Tsuut'ina) linguistic usage as we have done for years now, as far as I'm aware and have hopefully understood correctly. What is our misfortune here is that linguists (ourselves and myself) haven't been entirely consistent on these terms, either within individual languages nor cross-linguistically - as you may note from above, there appear to be a multitude of terms for a multitude of related concepts, for which neither the terms nor the concepts seem to have been as well defined and consistently used, and many seeming synonyms do not entirely match as to their denotation or connotation. On my part, I must confess the influence of Finnish (and Swedish/Nordic) lexicography here, though the relevance of that tradition to morphologically complex languages and computational morphology is not entirely to be dismissed.
The
Wordform
model represents a wordform in the source language.A wordform should have the following fields or properties:
is_head
: this wordform has explicit defintions.is_head == True
iff the wordform has one or more definitionsword_class
orspecific_word_class
— which could be a foreign key that references aSpecificWordClass
modelA wordform should no longer have the following fields:
as_is
pos
TBD for the following field:
is_lemma
: a lemma is the representative wordform for a lexeme. The codebase as of 2021-05-19 does not use theis_lemma
field to mean this; however, each wordform'slemma
field is a foreign key that references the representativeWordform
;lemma
representative wordforms reference themselves (self reference). Django magic allows you to domy_wordform.inflections
to get all stored inflections of the lexeme!@dwhieb please confirm that my linguistic terminology is accurate!
@andrewdotn, does this make sense? We can pair on it if you'd like!