UAlbertaALTLab / morphodict

The Language Independent Intelligent Dictionary
https://morphodict.readthedocs.io/
Apache License 2.0
22 stars 11 forks source link

update glossary with information from ALTLab dictionary database #755

Open dwhieb opened 3 years ago

dwhieb commented 3 years ago

Update the glossary and style guide with information from the ALTLab dictionary database. Add quotes / citations / references in support of terms where possible. Make these documents more visible by including a link to them from the README.

dwhieb commented 3 years ago

Notes on the term lemma

Continuous text consists of 'wordforms' (like permitted or permits), but the headwords in a dictionary are generally 'lemmas' like permit, and this is the object that we usually want to study. A lemmatization program takes as its input the various word-forms and maps them on to the lemma they belong to. (Atkins & Rundell 2008: 88)

The canonical form, sometimes called the lemma, is the form chosen to represent a paradigm; most headwords, with the exception of cross-references and names, are canonical forms. (Landau 2001: 98)

The lemma functions as a representative of a linguistic sign; in a dictionary it represents the lexical item described in the individual dictionary entry. (Svensén 2009: 93)

A common synonym of 'lemma' is 'headword'. In this book, the term 'lemma' has been preferred mainly because 'headword' will be somewhat problematic when the lemma consists of a multi-word lexical item. (Svensén 2009: 93 fn. 1)

References

aarppe commented 3 years ago

The glossary was created based on an extensive scrutiny of the electronic versions of the Cree dictionaries and glossaries we have at our disposal. Importantly, all these sources have as entries (with a dedicated definition or translation) not just 1) word-forms which are lemmas, but also 2) word-forms which are not lemmas, 3) phrases consisting of multiple word-forms which may not be all lemmas, and 4) morphemes.

Based on this, we decided to call individual items in an electronic dictionary database as a. an entry , and the label/title/key of an entry as b. a head (without word to deal with its nebulous nature). I haven't been entirely satisfied with these terms - e.g. one could use 1) entry for both the dictionary _record (a) and its name/label (b), or 2) record for the entirety (a) that the label entry (b) denotes, or 3) entry for the record (a) and key or label for the denotation (b).

Anyhow, we formulated a year ago the following overview of different types entries and what types of attributes they could take, and these were, or should have been described in our glossary. I've revised it so that I'm using entry for the entire record, and key for the label identifying and signifying the entry, or record. The primary organizing principle and starting point here are the different types of forms which are paired with meanings in the dictionary sources we have. In particular, as far as I know lemma is primarily reserved for individual word-forms (though I have seen some references to phrasal constructions such as verb + preposition pairings) nor sub-word items such as morphemes. On the other hand, the term lexeme has been suggested for any type of form which is listed in a dictionary, but that that would not seem to apply to a non-lemma inflected word-form nor a morpheme.

[edit: key -> head; removed references to speech, as they're stored elsewhere]

dictionary
1->n entry

entry
1->1 head
1->1 type: {wordform, lemma, phrase, morpheme}
1->n sense
   1->n definition/translation
   1->n source
[1->n recording]
1->1 key
...

entry/type = wordform
1->1 analysis/inflectional
   1->1 entry/lemma

entry/type = lemma
1->1 {wordclass, inflectional-category}
1->1 {stem, fststem}
1->1 analysis/inflectional
   1->1 entry/lemma
1->n analysis/derivational
1->1 analysis/inflectional
   1->1 entry/lemma
   1->n entry/morpheme

entry/type = phrase
1->n analysis/inflectional
   1:1 {entry/wordform, entry/lemma}

entry/type = morpheme
   1:n entry/morpheme   

I could imagine that the above could be organized primarily based on lemmas, so that inflected word-forms, phrases, and morphemes could all be grouped under lemmas, but that might come with its own complications.

Looking at the various lexicographical sources they seem biased to lemmas being individual word-forms (despite the occasional lip service to multi-word items, e.g. equating headword with lemma), and moreover appear constrained by the printed form and formatting of a dictionary. It would seem to me that a revision/proposal that would more accurately represent how an electronic dictionary/lexical database is organized and what sorts of items/entries it could contain might be worth pursuing. Consider e.g. the possibility of enumerating for any lemma all possible inflection word-forms as a paradigm, and providing generated or manual definitions for all these word-forms (which could be extended to apply to derivation(s) as well).

andrewdotn commented 3 years ago

My only comment here from a software perspective is that it would be incredibly useful for the linguists to provide a unique identifier for every dictionary entry. This can be used in URLs, database keys, and generated models to provide stable mappings to entries.

Danny mentioned this and I think we were going to call it ‘key’? It is a slightly different concept from the head(word) because, although it almost always matches the wordform, in the presence of homographs, we need different keys to refer to the different entries with the same text, e.g.,

Right now the code tries to do this, and it mostly works, but it slows down every search and the generated values can become invalid when the dictionary or FST get updated.

These would only be visible in (1) the source dictionary that the linguist edits, (2) URLs for homographs, (3) the database itself, and (4) other linguistic data that needs to refer to specific entries.

aarppe commented 3 years ago

As references for our established terminology usage anchored in computational linguistic tradition, the glossary in Richard Sproat's Computation and Morphology (1992) is most helpful, in particular for stem. For the meaning of lemma(tization) and lexeme, the examples suggest that the "label" for a lexeme that is derived with lemmatization is the representative form (aka FST-lemma) that has been discussed above.

STEM A (possibly polymorphemic) morphological unit to which an affix attaches. In overburdened, overburden is the stem to which -ed attaches, and burden is the stem to which over- attaches. In some authors' usage (see e.g. Bauer 1983, pp. 20-21), the term base is used as stem is used here, and stem refers only to the base for an inflectional affix; on this definition, overburden would be a stem, but burden would not be (since over- is not an inflectional affix.

LEMMATIZATION The process of computing a normalized form-e.g., a dictionary entry-for a word of text. For example, cats would lemmatize to cat.

LEXEME A "dictionary word." The set of lexemes of a language is given by the set of all word forms of the language after "normalization" for inflectional morphology. Both overburdened and overburdens are forms of the same lexeme, namely overbur­ den. The words overburden and burden constitute different lexemes, though one is clearly derived from the other. A "dictionary word." The set of lexemes of a language is given by the set of all word forms of the language after "normalization" for inflectional morphology. Both overburdened and overburdens are forms of the same lexeme, namely overbur­den. The words overburden and burden constitute different lexemes, though one is clearly derived from the other.

Another relevant computational morphology source is Beesley & Karttunen (2003), though they do not mention lemma at all in the index but refer throughout to baseform (multiple occurrences).

By convention, Xerox lexical transducers have lexical (upper-side) strings that consist of BASEFORMS and multicharacter-symbols and tags. The baseform itsef itself is the headword used conventionally when looking up the surface words in a standard printed (or perhaps online) dictionary. (p. 285)

However, the above passage continues by referring to lemma but quite confusingly, apparently corresponding to lexeme as defined by Sproat:

... For example, a Spanish verb lemma consist of perhaps 300 different surface forms, but only the infinitive form, the conventional baseform, appears as headword in dictionaries. (p. 285)

Next, an online glossary on language technological terminology by Kimmo Koskenniemi (the author of Two-Level Morphology which was the precursor of Xerox-style finite-state models), has the following definitions, with indications of English translation equivalents.

EN:lemma [FI:lemma] "Lemma is the header used for word-forms which belong together. Lemma is often the baseform of the headword for this set of inflected wordforms. The relatedness may be looser than that of inflected word-forms of the same baseform." Lemma on yhteen kuuluvien sananmuotojen otsikkona käytetty sana. Lemma on usein sen hakusanan perusmuoto, jonka taivutusmuodosta on kyse. Yhteenkuuluvuus voi olla väljempääkin kuin se, että sananmuodot ovat saman lekseemin taivutusmuotoja. (F. Karlsson 1998: s. 188)

EN:stem (of a word) [FI:vartalo, (sanan vartalo)] "With affixing a stem, one can get either new stems or wordforms (in some inflected form). A stem may consist of one or more morphemes. ..." Vartalosta saadaan affiksoimalla joko uusia vartaloita tai sananmuotoja (jossakin taivutusmuodossaan). Vartalo voi koostua yhdestä tai useammasta morfeemista. (R. Sproat 1992: Glossary, p. 249.)

Finally, one of the better-known textbooks on computational linguistics by Jurafsky and Martin (latest draft from December 2020 for next edition) has the following definitions (p. 3):

Another part of text normalization is lemmatization, the task of determining that two words have the same root, despite their surface differences. For example,the words sang, sung, and sing sare forms of the verb sing. The word sing is the common lemma of these words, and a lemmatizer maps from all of these to sing. Lemmatization is essential for processing morphologically complex languages like Arabic. Stemming refers to a simpler version of lemmatization in which we mainly stemming just strip suffixes from the end of the word.

Later on, lemma is more explicitly defined, seeming as the lexeme as well as the representative/citation wordform used for a lexeme (p. 12):

A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. The word-form is the full inflected or derived form of the word.

I believe that these references can be interpreted to yield out definitions for (FST-)lemma and (FST-)lemma, or alternatively, that our definitions are reconcilable with these references.

dwhieb commented 3 years ago

Thanks for taking the time to go through those references! This is helpful!