UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

syntax- vs morphology-based feature assignment #791

Closed TalhaBedir closed 1 year ago

TalhaBedir commented 3 years ago

In Turkish, the accusative-(y)I might be covert, in order to give an unspecific & indefinite reading:

(1) Pizza-yı ye-di-m.
    pizza-ACC eat-PAST-1s
   'I ate the pizza.'

(2) Pizza ye-di-m.
    pizza eat-PAST-1s
   'I ate a pizza.'

In the Turkish case paradigm, the suffixless noun is nominative. This means one would morphologically mark pizza 'pizza' in (2) nominative, not accusative. This is the case in the treebanks I looked up. Syntactically, however, it is accusative beyond any doubt.

Another similar decision can be made with plural subjects when their predicates morphologically seem singular. This is very common in Turkish, especially when the subject is non-human.

(3) Çiçek-ler güzel kok-uyor.
    flower-PL good smell-IMPERF
   'Flowers smell good.'

A plural predicate is kokuyorlar, with the plural suffix, instead of kokuyor. In this case, so-called syntax-based feature assignment requires Number=Plur, but morphology-based feature assignment requires Number=Sing.

In ellipsis and suspended affixation situations, this is more severely observed:

(4) Ali ve Ayşe-ye anlat-tı-m.
    Ali and Ayşe-DAT tell-PAST-1s
   'I told this to Ali and Ayşe.'

Dative suffix -(y)A obviously modifies whole coordinated NP Ali ve Ayşe, rather than just Ayşe. In our current annotation, which falls under morphology-based feature assignment, Ali is assigned Case=Nom. This is fallacious in my opinion. It should either be caseless or should carry the identical case as Ayşe.

Some of these cases probably require an issue on their own but I thought they all share a simple theoretical decision, therefore I included a bunch (not all) of them here.

sylvainkahane commented 3 years ago

Interesting problem. In all the examples you give, the feature is unmarked (zero signifier). One solution would be to add an unmarked value rather than Sing or Nom for these cases. (But to maintain Nom or Sing in other cases.) Do you have arguments to reject this solution?

TalhaBedir commented 3 years ago

@sylvainkahane That could be one of the better solutions, since it is what I call syntax-based. I prefer syntax-based to the other.

May you clarify "unmarked", though? If we take Case=Value feature for example, do you mean not writing any case at all or writing something like Case=Unmarked, Case=Elided, Case=_ or maybe Case=∅?

Stormur commented 3 years ago

Incidentally, I was also thinking about such issue recently. Probably something similar was discussed in #780 .

In my opinion, the best thing would be to mark things as they are: i.e. the bare nominal in Turkish (and other under this regard similarly functioning languages like Mongolian etc.) has not to be marked neither with Number nor with Case, because it is neither of them. It is simply unspecified with respect to these grammatical categories. Of course, the suffix -lAr would trigger Number=Plur; but the possible "syntactic number/case" might as well emerge just from relations like nsubj or obj or features on the predicate, and so on. And if it is not marked anywhere in the sentence, this means that it simply is not expressed, and has not to be annotated; else we would have overinterpretation.

Syntactically, however, it is accusative beyond any doubt.

This is the crucial point: the fact of being "accusative" and of fulfilling the core relation that we call "object" in the clause are distinct facts. Morphological features register what can be observed with regard to the form of the word: if a case marker is absent and, as is the case here, we cannot talk of a zero-suffix in a paradigm, there simply is no case feature.

Now, the question arises if an "absent" Number, Case, etc. feature is better annotated with a "negative" value rather than completely omitted. Another, maybe more conservative strategy might be to list all possible values (for example Case=Acc,Nom), which means "either... or...", but does not select one specifically... but the fact itself that a grammatical category is unspecified might make it difficult to find the right "list".

Stormur commented 3 years ago

Another correlated issue, hoping not to stray too far, is if the -(y)I suffix in Turkish actually is a real accusative suffix, or rather a mark of definiteness. In this interpretation (towards which I am leaning), the Turkish case system does not have a systematic case marking like nominative/accusative for nsubj/obj in Latin, but it may (optionally?) select definiteness for an obj. And in the end, the misleading identification of an "accusative" would have happened under the influence of grammars of Western languages.

If I am not mistaken, something reminding of this happens in Finnish, where there is no "accusative", but partitive is used both with subjects and objects according to specific rules.

dan-zeman commented 3 years ago

The features in UD are generally described as a part of morphological annotation and I find it natural to favor morphological criteria over syntactic. However, this is not a strict requirement, and examples could be found where a feature is partially or completely driven by other criteria, such as syntax or semantics.

What you call “syntactic accusative” is already recognizable by the obj relation, and it may be useful/interesting to be able to find out that in Turkish, a direct object is sometimes realized as a morphological nominative rather than accusative. I don't like the option of leaving case unspecified; in languages that have morphological cases, I expect every form of a noun to have a case label.

However, it is of course possible that two particular positions in the paradigm have the same surface form. So I can imagine that one would say that pizza is either Case=Nom or Case=Acc|Definite=Ind (these two would be disambiguated by context), while pizzayı is always Case=Acc|Definite=Def.

In any case, if the current approach in Turkish is modified, please make sure that

  1. A consensus is reached among all the teams responsible for individual Turkish treebanks
  2. The new approach is consistently applied in all Turkish treebanks in UD
  3. The approach is properly documented either on the Turkish index page or at relevant places linked from there.
Stormur commented 3 years ago

Effectively, in the end I agree with using Case=Nom, too. I was probably (again) misled by the name, but if we say that "nominative" represents in a sense the "basic" form of a nominal, then it is justified.

Still, I am left wondering if there is room for a distinction between marked nominatives (as in Latin, Georgian,...) and unmarked ones (Turkish, Mongolian, English?,...). The latter would then be the unspecified case (it has been called casus indefinitus in literature sometimes). Another related question is: in languages with no paradigmatic variation of cases (such as Italian), do we still want to have Case=Nom, so as to have a parallel with other languages? From this perspective, it is not so weird to assert that, yes, Italian only uses nominative = basic forms in all syntactic contexts.

For the marking of objs, then, I would favour only a morphological, form-based binary labelling as either Case=Nom (bare form) or Case=Acc|Definite=Def (-(y)I), as proposed. To insert a "syntactically motivated" bare form with Case=Acc would go too much into pragmatic/semantic annotation and risk being too arbitrary.

dan-zeman commented 3 years ago

in languages with no paradigmatic variation of cases (such as Italian), do we still want to have Case=Nom, so as to have a parallel with other languages?

No. If there is no variation, there is no need to have the feature. But Italian might use Case=Nom vs. Case=Acc or Case=Dat for pronouns, no? (Just guessing based on what I know about Spanish.)

Also, some languages that have case variation will have Case=Abs as the unmarked form.

Stormur commented 3 years ago

Ah, yes, of course, I'm always forgetting pronouns and was thinking only of nouns/adjectives. For PRONs, this is surely needed in Italian and it would be those three. Like Spanish and others, the only aspect of the language which has retained cases (and in a sense, probably also neuter gender, even if subsumed under the feminine).

If I am not mistaken, Case=Abs is for ergative phenomena, right? But if we can think that some languages, as Turkish, can use "nominative" in an obj function, has it ever been discussed if Abs (or Erg) and Nom could be unified under one label, which then would just see a different syntactical distribution?


I understand the logic of not needing the feature, but am also thinking, from an operational point of view, of a multilingual search in which one would have to specify something like "no Case, or Case=Abs|Nom". The fact of often having to specify also negative information, with the risk of forgetting something, makes me wonder about the usefulness of a maybe redundant, but more homogeneous labeling.

dan-zeman commented 3 years ago

If I am not mistaken, Case=Abs is for ergative phenomena, right? But if we can think that some languages, as Turkish, can use "nominative" in an obj function, has it ever been discussed if Abs (or Erg) and Nom could be unified under one label, which then would just see a different syntactical distribution?

It hasn't, as far as I know. I suppose this is one of the points where UD stays close to traditional terminology in the hope that it will be better understood by the general crowd. While obviously there are other points where it departs from the tradition quite substantially :-}

sylvainkahane commented 3 years ago

I also agree to use Case=Nom for the zero case. But I just want to make a distinction between zero morphemes and unmarking. Case=Nom would mean that we consider that there is a case marker, that is, that the absence of any other case marker is meaningful. It must not be confused with the situation of languages where the cases tend to disappear (as in spoken Korean today and as in Latin during the switch to Romance languages). In this latter situation, I think we should avoid a Case feature or have a special value, like Case=Absent.

rueter commented 1 year ago

In Uralic language studies there are different schools of thought regarding the use of Case=Nom vs Case=Acc for annotating the direct object when no identifying morphology is present. Research in the former Soviet Union seems to intermingle morphology with syntax -- this applies to krl, olo, kpv, koi. Shouldn't we be trying to limit ourselves to: column6=morphological features, column8=dependencytypes? Or is there a reason to reiterate dep-information in the features and feature-information in the deps? I am guessing that Turkish may also have postpositions with a nominative-form as their complements. How are they dealt with? I would call them Case=Nom, whereas the dep pointing to the adp would be case.

dan-zeman commented 1 year ago

Research in the former Soviet Union seems to intermingle morphology with syntax -- this applies to krl, olo, kpv, koi.

What does it mean? Would they distinguish Case=Nom and Case=Acc for one word form depending on how it is used? And does it mean that the language never has different forms for "nominatives" and "accusatives", or is it just case syncretism for some lexemes while the cases are distinguished for other lexemes?

rueter commented 1 year ago

Yes, they distinguish Case=Nom and Case=Acc for one word form depending on how it is used, sorry. In Karelian and Livvi the system works much the same as in Finnish and Estonian, i.e., in imperative predication the complete (not partitive) direct object noun (not personal pronoun) appears in the nominative form, elsewhere this same function in the singular is indicated by a genitive form, e.g. kala Case=Nom but kalan Case=Gen 'fish'. In the plural, kalat Case=Nom is used in both imperative and non-imperative. (The Finnish take on the situation) Grammars written in Karelia, however, say that both the bare form in kala and genitive in kalan are also Case=Acc. Likewise the plural form kalat is both Case=Nom and Case=Acc. Accusative is used when the syntactic function is direct object.

@nikopartanen Komi-Zyrian and Komi-Permyak follow a slightly different scheme in which the direct object noun might be marked by one of three possessive suffix +accusative formatives -ӧс (1 Sg Acc), -тӧ (2 Sg Acc), -сӧ(3 Sg Acc) OR zero=equals nominative, in both singular and plural. The form is typically used as the object marker, even when no possession is involved. The strategy entails hierarchies of animacy and identifiability, such that humans are rarely without a formative and inanimates, on the other hand, rarely take a formative without its indicating possession. Once again the bare form(a.k.a. Nominative when nsubj) for 'fish' чери is called Case=Acc when a direct object, and the 1sg possessive suffix accusative form in чериӧс is also called Case=Acc. Two readings are given for the each form, see comma seperation: чери => Case=Nom, Case=Acc чериӧс => Case=Acc|Number[pos]=1|Person[pos]=Sing, Case=Acc

dan-zeman commented 1 year ago

(The Finnish take on the situation) Grammars written in Karelia, however ...

I think the "Finnish approach" is the one that should be used in UD, and if our Karelian and Livvi treebanks use the "Karelian approach", it would be good to fix them.

Komi-Zyrian and Komi-Permyak follow a slightly different scheme

Just to clarify: When you say "possessive suffix +accusative formatives", do you mean that 1. there are two morphemes, the first one is a possessive suffix not shown in the example, and the second one is the accusative formative -ӧс/-тӧ/-сӧ, OR do you mean that 2. there is just one morpheme -ӧс/-тӧ/-сӧ, which can be interpreted either as a possessive suffix or an accusative formative?

rueter commented 1 year ago

Hi, neither 1. nor 2. In Komi-Zyrian and Komi-Permyak by the possessive suffix +accusative formatives I meant:

Чериӧс пуи \n fish I-cooked
obj(пуи, Чериӧс)

'I cooked the/that fish' OR 'I cooked my fish'

Черитӧ пуи \n fish I-cooked
obj(пуи, Черитӧ)

'I cooked that fish [we were talking about]' OR 'I cooked your fish'

Черисӧ пуи \n fish I-cooked
obj(пуи, Черисӧ)

'I cooked the/that fish [distinguishing it from other cookable items, perhaps]' OR 'I cooked his/her/its fish'

Чери пуи \n fish I-cooked
obj(пуи, Чери)

'I was cooking fish [generic]' OR 'I cooked fish [generic]'

@nikopartanen please, say something if this analysis is wrong.

In kpv and koi UD projects, we has chosen the UD-like approach where only distinct morphology is marked. In other words, we have used the following readings

The zero with an object dependency is labeled Case=Nom

dan-zeman commented 1 year ago
  • Number[pos]=1

I am assuming you mean Number[psor] (https://universaldependencies.org/ext-feat-index.html#numberpsor) rather than Number[pos], and same for Person.

The approach of adding both Case=Acc and the two possessive features for the -ӧс/-тӧ/-сӧ forms, regardless of whether the speaker actually meant to express possession, looks good to me. Likewise, I would annotate Чери as Case=Nom, regardless of whether it was used as subject, object, or something else.

rueter commented 1 year ago

Yes, thanks @dan-zeman , that should be [psor] with both Number[psor] and Person=[psor]. So I would think that Turkish might also use Case=Nom for instances of zero morphology marking on obj-dep tokens.

dan-zeman commented 1 year ago

So I would think that Turkish might also use Case=Nom for instances of zero morphology marking on obj-dep tokens.

Yes, that would be my preference. On my opinion, omitting the Case feature would suggest that Turkish nouns (or maybe a subset thereof) do not inflect for case. But if they do, then I think that every form should be annotated by the Case it corresponds to.

Stormur commented 1 year ago

I was considering again this issue after these last inputs, and indeed I find myself leaning towards a solution like the one proposed by @sylvainkahane , for all the reasons already discussed at length:

It must not be confused with the situation of languages where the cases tend to disappear (as in spoken Korean today and as in Latin during the switch to Romance languages). In this latter situation, I think we should avoid a Case feature or have a special value, like Case=Absent.

This point is critical and I think that it follows straight from one of UD principles to "annotate only what is there". Simply, some systems have unspecified forms with some patterns to make certain categories like Number, Case explicit under some circumstances, while others (like Latin's standard names) admit no unspecified forms. And sometimes the same language might have subsystems (as seen for Italian's pronouns, the marginal class of Latin indeclinable nouns, etc.).

Maybe another label for the value could be BaseForm or Unspecified. But I fear that something like Number=Sing for an uninflected e.g. Turkish, or also English, noun might be very, very misleading in the end.


For example in English (and feel free to correct me if I am talking nonsense), we might argue that singular number is no longer expressed morphologically and is left to semantics: I cannot guess from forms like ball or rice alone that one accepts a plural inflection (balls) and the other does not if I do not know anything about the objects. But this is different from Italian where I know that palla 'ball', unerringly, is singular because the ending -a contrasting -e in palle 'balls' needs to be there, and for other word classes the same is expressed by other endings (e.g. lup-o 'wolf'). And in Italian, for this exact reason, it is probably much more viable than in English to inflect riso 'rice' in the plural risi (which would be usually interpreted as 'sorts of rice', or 'dishes of rice', and so on).

dan-zeman commented 1 year ago

label for the value could be BaseForm or Unspecified

No, no , no :-) Since the very beginning of UD, we have been clear about not explicitly using "Unspecified" as a value of a feature; instead, the feature is omitted.

Stormur commented 1 year ago

label for the value could be BaseForm or Unspecified

No, no , no :-) Since the very beginning of UD, we have been clear about not explicitly using "Unspecified" as a value of a feature; instead, the feature is omitted.

Thanks Dan for once again sobering me.

So, under this light, my preference would remain to not tag for Case under these circumstances. We are actually doing this already for indeclinables (InflClass=Ind) in Latin, for example, which are a kind of special subclass of nominals. Even if, thinking about it, InflClass=Ind is an unspecified value, so better get rid of it?

dan-zeman commented 1 year ago

InflClass=Ind is an unspecified value, so better get rid of it?

Deciding whether one of the values should be considered unspecified is a more delicate issue, and often requires to look at the language-specific context. (As you say, should a form be caseless or should it be Case=Nom; should we have both values of Polarity, or only Neg? etc.) But once the value string is something like Unspecified, None or Null, it is quite obvious that it should not be there.

sylvainkahane commented 1 year ago

A non-linguistic remark: For the maintenance of a treebank, it is very useful to know that a feature traditionally associated with a given POS is not missing by inadvertence but has been deliberately omitted. It seems that the better way to do that is to add a feature. @dan-zeman Do we have a particular policy concerning such features?

dan-zeman commented 1 year ago

A non-linguistic remark: For the maintenance of a treebank, it is very useful to know that a feature traditionally associated with a given POS is not missing by inadvertence but has been deliberately omitted. It seems that the better way to do that is to add a feature. @dan-zeman Do we have a particular policy concerning such features?

As I said above, the policy is not to add such features :-)

If desired/necessary for treebank maintenance, a note could be added in MISC. But I personally do not see a principal difference from the situation that a feature value is replaced with a wrong one (e.g., Case=Nom instead of Case=Acc) by inadvertence. Annotation errors can still occur, and a search for one of the values will expose both the correct cases and the wrong ones.