UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

Status of morphosyntactic features and values #985

Open sylvainkahane opened 10 months ago

sylvainkahane commented 10 months ago

This a follow-up of previous discussions we had to decide if a feature must be instanciated or not (the last one concerning the Voice feature in English, see #290).

First, some features are features associated to inflectional morphemes, while others are lexical features. Examples of lexical features are Gender, Number and Person on pronouns in English, while inflectional feature are Number and Person on the verb agreeing with its subject:

she      PRON, Gender[lex]=Fem, Number[lex]=Sing, Person[lex]=3
wants    VERB, VerbForm[infl]=Fin, Tense[infl]=Pres, Number[infl]=Sing, Person[infl]=3

Another example is the Gender agreement of the adjective and articles with the noun in French. Gender is lexical feature of NOUNs (Gender[lex]), while Gender is an inflectional feature on ADJs or DETs (Gender[infl]):

la  DET, Gloss=the:SG.FEM, Definite[lex]=Def, Gender[infl]=Fem, Number[infl]=Sing, PronType[lex]=Art
table   NOUN, Gloss=table:SG, Gender[lex]=Fem, Number[infl]=Sing
blanche ADJ, Gloss=white:SG:FEM, Gender[infl]=Fem, Number[infl]=Sing

Note that Definiteness is a lexical feature, while Number is an inflectional feature on NOUNs, ADJs and articles (it is a lexical feature on most other DETs). PronType is an inherently lexical feature.

But there is a third use of morphosyntactic features: the denominative use. For instance, English has two participles which are the so-called present and past participles. The English treebank use the features Tense=Pres and Tense=Past to distinguish the two participles. It is quite problematic because these participles have more aspectual values than temporal:

driven  VERB, VerbForm[infl]=Part, Tense[denom]=Past, Aspect[infl]=Imp
driving VERB, VerbForm[infl]=Part, Tense[denom]=Pres, Aspect[infl]=Prog

Second, in some case, an inflectional feature is not instantiated on a given lexeme. For instance, French has many ADJs, which do not show variation in Gender, such as rouge ‘red’, facile ‘easy’, etc. Nevertheless, the value can generally be deduced from the context. For the French treebanks, we thus have instantiated the Gender feature each time its value could be deduced from the context. This could be indicated on the value:

table   NOUN, Gloss=table:SG, Gender[lex]=Fem, Number[infl]=Sing
rouge   ADJ, Gloss=red:SG, Gender[infl]=Fem[ctxt], Number[infl]=Sing

English treebanks contain a lot of contextual values (due to the very poor inflectional morphology of English). For instance, every -ing verbal form can be VerbForm=Part or VerbForm=Ger. This can only be deduced from the context: not any English verb has a different form for the present participle and the gerund. “Only-contextual features" could be distinguished:

driving VERB, VerbForm[infl]=Part[ctxt], Tense[denom]=Pres, Aspect[infl]=Prog[only-ctxt]
driving VERB, VerbForm[infl]=Ger[only-ctxt]

Note that VerbForm=Part is just [ctxt] because the value is marked for past participles of some verbs (those distinguishing past participles and preterit). For past participles of transitive verbs, we have an opposition between imperfect forms (she has driven the car) and passive forms (the car was driven):

driven  VERB, VerbForm[infl]=Part, Tense[denom]=Past, Aspect[infl]=Imp[ctxt]
driven  VERB, VerbForm[infl]=Part, Tense[denom]=Past, Voice[infl]=Pass[only-ctxt]

The bare form of the verb is also ambiguous and can only be disambiguated contextually:

drive   VERB, VerbForm[infl]=Inf[only-ctxt]
drive   VERB, VerbForm[infl]=Fin[ctxt], Tense[infl]=Pres[ctxt], Number[infl]=Plur[ctxt], Person[infl]=1[ctxt]

Note that the value VerbForm=Inf is only contextual [only-ctxt], but not the value VerbForm=Fin, since finiteness is marked for the 3SG present form, as well as for the past form of some verbs.

drives  VERB, VerbForm[infl]=Fin, Tense[infl]=Pres, Number[infl]=Sing, Person[infl]=3
drove   VERB, VerbForm[infl]=Fin, Tense[infl]=Past, Number[infl]=Sing[ctxt], Person[infl]=3[ctxt]

It means that we can distinguish features for which some values can be marked (VerbForm, Tense, etc.) and features for which all values are contextual (Voice).

Of course it would be too costly to add the status of features and values to each occurrence, but it would be useful for people exploiting a treebank to know the status of features and values. We could ask in the guidelines associated with the validator whether a feature is inflectional [infl], lexical [lex] or denominative [denom]. Maybe also if the feature has values which are only contextual [only-ctxt]. Such information would be very useful for linguists exploiting the treebanks. If we currently study noun-adjective Gender agreement in French, it would be difficult with only the treebank to know when this agreement is really effective. Same thing with the verb-subject Person agreement in English. And if we study Tense in English (without any knowledge of the language), we would have strange results due to the Tense feature on participles.

nschneid commented 10 months ago

This is a very interesting idea that would start moving UD from a theory of what the features and values of a language are, toward why they occur/how they are signaled (the form of a lexeme, agreement, context, etc.).

What gives me some pause is the interaction between the categories of [infl], [lex], [denom], etc. across different features (and possibly parts of speech). Taking the English VERBs example, Tense is sometimes [infl] and sometimes [denom] depending on the VerbForm. What sorts of dependencies would there be between features? Would there need to be a rule system (or state machine) to compute which are in which category?

amir-zeldes commented 10 months ago

I agree it's an interesting view on the inventory of each language, though I would point out it depends on a variety of annotation choices. For example, the Gender of "she" is only lexical if we lemmatize "she" separately. We could have decided that all pronouns have a single lemma, and "she" is the inflected feminine form of that lemma - in English that might seem bizarre, but in other languages the boundary between inflection and derivation (which I take to fall under 'lexical' in this typology) can be murky. In English, for example, we could argue about whether the morphological comparative is lexical or inflectional, given that not all English adjectives can add -er.

In any case, I think this is more a topic for an analytic paper than an annotation concern, since the classification into [infl], [lex] etc. would presumably happen at the language level, not at the level of individual treebank tokens, right? Of course we could also use these terms in justifying certain guidelines.

sylvainkahane commented 10 months ago

Thanks for your feedback. @nschneid You're right that the feature Tense is used in English both as an inflectional and a denominative feature. Two solutions are possible (which can both be implemented):

  1. Put the information on each occurrence:

walked VERB, VerbForm[infl]=Fin[ctxt], Tense[infl]=Past, Number[infl]=Sing[ctxt], Person[infl]=3[ctxt] walked VERB, VerbForm[infl]=Part[ctxt], Tense[denom]=Past, Aspect[infl]=Imp

  1. Have a general description:

    upos=VERB, VerbForm=Fin => Tense[infl] upos=VERB, VerbForm=Part => Tense[denom]

@amir-zeldes You are right that there are cases where it is unclear whether a feature is lexical or inflectional. But there are always cases in corpus annotation where we must make a choice and the choice is not straightforward. It is clear that the case of pronouns is one of the most problematic. In some sense we already decide whether we consider the features to be lexical or inflectional when we choose the lemma form. For instance, how to interpret the fact that her asCase=Acc, lemma=she when it is an object pronoun and Case=Gen, lemma=her when it is a possessive determiner in GUM (https://universal.grew.fr/?custom=654652797cea8)? I suppose it means that you consider Case[infl]=Acc for PRON and Case[lex]=Gen for DET. It could be made explicit.

nschneid commented 10 months ago

I suppose it means that you consider Case[infl]=Acc for PRON and Case[lex]=Gen for DET.

These are all PRON in UD, not DET. You're right that the lemma and case of her depend on whether it is possessive or not. Here is the full paradigm: https://universaldependencies.org/en/pos/PRON.html

A current difference between GUM and EWT is that GUM applies Number=Sing or Number=Plur to you depending on context. EWT leaves number unspecified.

sylvainkahane commented 10 months ago

I continue with the example of pronouns in English. My purpose is not to discuss the choices made in these treebanks, but just to take this as an example about the status of features. (By the way, we had exactly the same problems for the annotation of pronouns in French and I don't think that our annotation is optimal.) We have:

we       lemma=we, Case=Nom us        lemma=we, Case=Acc our      lemma=our, Case=Gen, Poss=Yes ours    lemma=our, Poss=Yes

By definition, the lemma is the conventional name chosen for a lexeme. If I analyze this annotation, it means that we and us are the two forms of the lexeme we and Case is then an inflectional feature. On the other hand, our and ours are the two forms of the lexeme our Only one of them has a feature Case, which I suppose must be interpreted as a lexical feature, as well as the feature Poss=Yes. I think that a better consideration of the status of the feature Case could have conducted to a different analysis, maybe considering the four forms as inflected forms of a same lexeme.

sylvainkahane commented 10 months ago

I would like to emphasize that best problem concerning the status of features is the use of some features as denominations (such as Tense for participles). If we keep only one thing about this thread, it must be the clear distinction between denominative features and linguistic features.

nschneid commented 10 months ago

I think that a better consideration of the status of the feature Case could have conducted to a different analysis, maybe considering the four forms as inflected forms of a same lexeme.

We had a long discussion when creating this table. I agree it may not be perfect from a theoretical perspective. One practical factor was preexisting lemmatization standards—as I recall, the established practice was to lemmatize possessives separately from non-possessives. Another factor was that there wasn't an obvious feature available to distinguish independent vs. dependent possessives (some sources call them both "genitives", but we decided to call them both Poss=Yes and distinguish them by using Case=Gen only for the independent ones).

amir-zeldes commented 10 months ago

Lemmatization is yet another, somewhat separate topic, which also has to do with lexicographic standards in a language, etymology, and more. In most Indo-European languages, the de-adjectival possessives (Lat. meus "my"), are distinguished from the true pronominal genitives (Lat. mei "of me", genitive < lemma ego). But @nschneid is right in saying that this is perhaps more of a standardization question, and indeed, many, much larger corpora than the UD ones are lemmatized, and breaking with their tradition would be a high price in terms of interoperability of linked open data resources. Personally I'm happy to keep things stable and would sooner change "my" to not be Case=Gen than change the lemma (and historically, it is in fact not Case=Gen, though "its/his/her" is)

Stormur commented 9 months ago

I would like to emphasize that best problem concerning the status of features is the use of some features as denominations (such as Tense for participles). If we keep only one thing about this thread, it must be the clear distinction between denominative features and linguistic features.

Features must not be used denominatively because this is not what they are meant to. There can be other spaces for that, for example MISC. Assigning Tense=Pres to (English) participles because they are called "present participles" is just a logical short-circuit (or if I have to be terser, plain wrong).


With regard to the distinction between lexcial and inflectional feature, I have also mused about marking this distinction, but in the end I I think that probably this "tag polysemy" can be maintained: the important thing is that the feature correspond to some morphological property, e.g. in English Number=Plur ~ -s etc. Then lexicality or inflectionality depends on the distribution across forms, for example we will see that NOUNs are much less variable in Gender than ADJs, but also that there might be some correlation with given affix-series.

If a feature can be determined only purely contextually, then I advocate for not annotating it: it simply is not there morphologically.

Then the case of English -ing forms appears to me as one of random coincidence: the phonological material is the same, but we can actually distinguish the nominal (annotating is nice) from the adjectival (the annotating person) forms. This issue admittedly becomes a little tricky in that it approaches contextual annotation. Another case is Latin cum: the ADP and the SCONJ have actually different etymological histories, and they do behave in a clearly distinguishable way.