Open sylvainkahane opened 1 year ago
This is a very interesting idea that would start moving UD from a theory of what the features and values of a language are, toward why they occur/how they are signaled (the form of a lexeme, agreement, context, etc.).
What gives me some pause is the interaction between the categories of [infl]
, [lex]
, [denom]
, etc. across different features (and possibly parts of speech). Taking the English VERBs example, Tense
is sometimes [infl]
and sometimes [denom]
depending on the VerbForm
. What sorts of dependencies would there be between features? Would there need to be a rule system (or state machine) to compute which are in which category?
I agree it's an interesting view on the inventory of each language, though I would point out it depends on a variety of annotation choices. For example, the Gender of "she" is only lexical if we lemmatize "she" separately. We could have decided that all pronouns have a single lemma, and "she" is the inflected feminine form of that lemma - in English that might seem bizarre, but in other languages the boundary between inflection and derivation (which I take to fall under 'lexical' in this typology) can be murky. In English, for example, we could argue about whether the morphological comparative is lexical or inflectional, given that not all English adjectives can add -er.
In any case, I think this is more a topic for an analytic paper than an annotation concern, since the classification into [infl]
, [lex]
etc. would presumably happen at the language level, not at the level of individual treebank tokens, right? Of course we could also use these terms in justifying certain guidelines.
Thanks for your feedback.
@nschneid You're right that the feature Tense
is used in English both as an inflectional and a denominative feature. Two solutions are possible (which can both be implemented):
walked VERB, VerbForm[infl]=Fin[ctxt], Tense[infl]=Past, Number[infl]=Sing[ctxt], Person[infl]=3[ctxt] walked VERB, VerbForm[infl]=Part[ctxt], Tense[denom]=Past, Aspect[infl]=Imp
upos=VERB, VerbForm=Fin => Tense[infl] upos=VERB, VerbForm=Part => Tense[denom]
@amir-zeldes You are right that there are cases where it is unclear whether a feature is lexical or inflectional. But there are always cases in corpus annotation where we must make a choice and the choice is not straightforward. It is clear that the case of pronouns is one of the most problematic. In some sense we already decide whether we consider the features to be lexical or inflectional when we choose the lemma form. For instance, how to interpret the fact that her asCase=Acc, lemma=she
when it is an object pronoun and Case=Gen, lemma=her
when it is a possessive determiner in GUM (https://universal.grew.fr/?custom=654652797cea8)? I suppose it means that you consider Case[infl]=Acc
for PRON and Case[lex]=Gen
for DET. It could be made explicit.
I suppose it means that you consider
Case[infl]=Acc
for PRON andCase[lex]=Gen
for DET.
These are all PRON in UD, not DET. You're right that the lemma and case of her depend on whether it is possessive or not. Here is the full paradigm: https://universaldependencies.org/en/pos/PRON.html
A current difference between GUM and EWT is that GUM applies Number=Sing
or Number=Plur
to you depending on context. EWT leaves number unspecified.
I continue with the example of pronouns in English. My purpose is not to discuss the choices made in these treebanks, but just to take this as an example about the status of features. (By the way, we had exactly the same problems for the annotation of pronouns in French and I don't think that our annotation is optimal.) We have:
we lemma=we, Case=Nom us lemma=we, Case=Acc our lemma=our, Case=Gen, Poss=Yes ours lemma=our, Poss=Yes
By definition, the lemma is the conventional name chosen for a lexeme. If I analyze this annotation, it means that we and us are the two forms of the lexeme we and Case
is then an inflectional feature. On the other hand, our and ours are the two forms of the lexeme our Only one of them has a feature Case
, which I suppose must be interpreted as a lexical feature, as well as the feature Poss=Yes
. I think that a better consideration of the status of the feature Case
could have conducted to a different analysis, maybe considering the four forms as inflected forms of a same lexeme.
I would like to emphasize that best problem concerning the status of features is the use of some features as denominations (such as Tense
for participles). If we keep only one thing about this thread, it must be the clear distinction between denominative features and linguistic features.
I think that a better consideration of the status of the feature
Case
could have conducted to a different analysis, maybe considering the four forms as inflected forms of a same lexeme.
We had a long discussion when creating this table. I agree it may not be perfect from a theoretical perspective. One practical factor was preexisting lemmatization standards—as I recall, the established practice was to lemmatize possessives separately from non-possessives. Another factor was that there wasn't an obvious feature available to distinguish independent vs. dependent possessives (some sources call them both "genitives", but we decided to call them both Poss=Yes
and distinguish them by using Case=Gen
only for the independent ones).
Lemmatization is yet another, somewhat separate topic, which also has to do with lexicographic standards in a language, etymology, and more. In most Indo-European languages, the de-adjectival possessives (Lat. meus "my"), are distinguished from the true pronominal genitives (Lat. mei "of me", genitive < lemma ego). But @nschneid is right in saying that this is perhaps more of a standardization question, and indeed, many, much larger corpora than the UD ones are lemmatized, and breaking with their tradition would be a high price in terms of interoperability of linked open data resources. Personally I'm happy to keep things stable and would sooner change "my" to not be Case=Gen than change the lemma (and historically, it is in fact not Case=Gen, though "its/his/her" is)
I would like to emphasize that best problem concerning the status of features is the use of some features as denominations (such as
Tense
for participles). If we keep only one thing about this thread, it must be the clear distinction between denominative features and linguistic features.
Features must not be used denominatively because this is not what they are meant to. There can be other spaces for that, for example MISC
. Assigning Tense=Pres
to (English) participles because they are called "present participles" is just a logical short-circuit (or if I have to be terser, plain wrong).
With regard to the distinction between lexcial and inflectional feature, I have also mused about marking this distinction, but in the end I I think that probably this "tag polysemy" can be maintained: the important thing is that the feature correspond to some morphological property, e.g. in English Number=Plur
~ -s etc. Then lexicality or inflectionality depends on the distribution across forms, for example we will see that NOUN
s are much less variable in Gender
than ADJ
s, but also that there might be some correlation with given affix-series.
If a feature can be determined only purely contextually, then I advocate for not annotating it: it simply is not there morphologically.
Then the case of English -ing forms appears to me as one of random coincidence: the phonological material is the same, but we can actually distinguish the nominal (annotating is nice) from the adjectival (the annotating person) forms. This issue admittedly becomes a little tricky in that it approaches contextual annotation. Another case is Latin cum: the ADP
and the SCONJ
have actually different etymological histories, and they do behave in a clearly distinguishable way.
This a follow-up of previous discussions we had to decide if a feature must be instanciated or not (the last one concerning the
Voice
feature in English, see #290).First, some features are features associated to inflectional morphemes, while others are lexical features. Examples of lexical features are
Gender
,Number
andPerson
on pronouns in English, while inflectional feature areNumber
andPerson
on the verb agreeing with its subject:Another example is the Gender agreement of the adjective and articles with the noun in French.
Gender
is lexical feature of NOUNs (Gender[lex]
), whileGender
is an inflectional feature on ADJs or DETs (Gender[infl]
):Note that Definiteness is a lexical feature, while Number is an inflectional feature on NOUNs, ADJs and articles (it is a lexical feature on most other DETs). PronType is an inherently lexical feature.
But there is a third use of morphosyntactic features: the denominative use. For instance, English has two participles which are the so-called present and past participles. The English treebank use the features
Tense=Pres
andTense=Past
to distinguish the two participles. It is quite problematic because these participles have more aspectual values than temporal:Second, in some case, an inflectional feature is not instantiated on a given lexeme. For instance, French has many ADJs, which do not show variation in Gender, such as rouge ‘red’, facile ‘easy’, etc. Nevertheless, the value can generally be deduced from the context. For the French treebanks, we thus have instantiated the
Gender
feature each time its value could be deduced from the context. This could be indicated on the value:English treebanks contain a lot of contextual values (due to the very poor inflectional morphology of English). For instance, every -ing verbal form can be
VerbForm=Part
orVerbForm=Ger
. This can only be deduced from the context: not any English verb has a different form for the present participle and the gerund. “Only-contextual features" could be distinguished:Note that
VerbForm=Part
is just[ctxt]
because the value is marked for past participles of some verbs (those distinguishing past participles and preterit). For past participles of transitive verbs, we have an opposition between imperfect forms (she has driven the car) and passive forms (the car was driven):The bare form of the verb is also ambiguous and can only be disambiguated contextually:
Note that the value
VerbForm=Inf
is only contextual[only-ctxt]
, but not the valueVerbForm=Fin
, since finiteness is marked for the 3SG present form, as well as for the past form of some verbs.It means that we can distinguish features for which some values can be marked (
VerbForm
,Tense
, etc.) and features for which all values are contextual (Voice
).Of course it would be too costly to add the status of features and values to each occurrence, but it would be useful for people exploiting a treebank to know the status of features and values. We could ask in the guidelines associated with the validator whether a feature is inflectional
[infl]
, lexical[lex]
or denominative[denom]
. Maybe also if the feature has values which are only contextual[only-ctxt]
. Such information would be very useful for linguists exploiting the treebanks. If we currently study noun-adjective Gender agreement in French, it would be difficult with only the treebank to know when this agreement is really effective. Same thing with the verb-subject Person agreement in English. And if we study Tense in English (without any knowledge of the language), we would have strange results due to the Tense feature on participles.