UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
270 stars 245 forks source link

v2 conventions for copula #329

Closed nschneid closed 7 years ago

nschneid commented 8 years ago

The standing proposal for v2 concludes that the copula should serve as the head in special constructions like existentials (already in v1) and non-equational predications like "The book is on the table".

I would like to offer a counterproposal that addresses some arguable weaknesses of that policy:

  1. Annotators (and conversion rules) might have difficulty checking for these special uses of the copula.
  2. Some languages have null copula constructions. For example, in Hebrew, present tense copular sentences either have no overt verb, or substitute a third person pronoun to act as a copula. E.g., 'The boy is a student' would be hayeled student or hayeled hu student (lit. The-boy he student).
  3. Some languages have existential constructions which do NOT involve a copula, so the policy would not readily address such cases.

Instead, I think we should consider simply distinguishing copular subjects from non-copular subjects, and always keep the copula itself as a functional dependent. Then, the annotator/converter would just have to check whether the main verb is a copula (in English, a form of BE or possibly BECOME) and if so mark its subject as ncopsubj or ccopsubj. (The Hebrew treebank currently calls this nsubj:cop.) "The cup is a mug" and "The cup is on the table" would thus have the same parse, modulo the extra case relation for the preposition. It is up to downstream users to determine whether certain functions of the copula should receive different semantics.

I think this policy would ease the burden on annotators and avoid complications in languages where the copula can be covert or does not have the same range of functions as in English.

amir-zeldes commented 8 years ago

Seconded. I think it's more robust to have the PP be the head of the predication, since in many languages the copula is optional. What's guaranteed to be there is the locative expression itself. In Coptic, for example, locative predication does not take a copula at all:

ti mmau - lit. 'I there' ti hm-p-eei - lit. 'I in the house'

Making all languages mark locative predication on the locative phrase itself makes these languages much more comparable to the English situation: the only difference is that in English, you get the copula as an extra auxiliary, and even that is not 100% guaranteed, e.g. in examples like:

Ali G. in da house

If the copula were the head, the above is an exception. If 'house' is the head, this is not problematic at all.

However I'm more for nsubj:cop than adding another major label, I think it can comfortably be subsumed under subject (we are talking about syntactic subjects anyway, so whether or not it has an agent-like theta-role is less relevant in my opinion)

dan-zeman commented 8 years ago

Would you then extend it to all adverbials and PPs, not only locative? E.g. temporal in the show is today?

What would you do with multiple adverbials such as

I am here today (and there tomorrow). I am here in Prague.

dan-zeman commented 8 years ago

The more I think of it the more I wonder whether the entire copula business is worth it. It feels like we are striving to achieve cross-linguistic parallelism on one side, but the price is that on the other side, we lose parallelism between similar constructions within one language.

nschneid commented 8 years ago

Would you then extend it to all adverbials and PPs, not only locative? E.g. temporal in the show is today?

Yes, I think it's fair to say today is the main predicate and should therefore be the head (though by not using nmod we wouldn't get the refinement of nmod:tmod).

What would you do with multiple adverbials such as

I am here today (and there tomorrow). I am here in Prague.

I think one of the adverbials would have to be designated as the head of the other. I would definitely go with here in the first sentence and probably also in the second sentence. Note that omitting here in the first sentence gives a very different meaning: #I am today. In the second sentence, either can be omitted, but it seems to me that in Prague elaborates on the meaning of here.

amir-zeldes commented 8 years ago

I agree - I can think of a number of other ways to annotate these, including coordination or apposition, but I think @nschneid 's proposal is the most reliable. From some perspective I feel arguably 'here' and 'in Prague' are a kind of oblique apposition (if taken literally, then they both fulfill exactly the same semantic role). But the UD appos definition is explicitly limited to nominal, and I can imagine there will be more disagreement in corner cases, so I'm fine with 'in Prague' modifying 'here' too.

As for the root, yes, I think we're already not distinguishing nominal and adjectival predication, so is there a good reason to treat adverbial predication differently? I agree that adding :tmod to the root label would be kind of cool, but there are lots of cases where something being the root obscures some other function (I'm thinking of vocative fragments like "Jane!" not being able to be labeled vocative, for example).

manning commented 8 years ago

I support what @nschneid and @amir-zeldes are proposing and their arguments. At Stanford, we have honestly spent many, many hours analyzing and reanalyzing the copula and trialling different alternatives. (As usual) one can find a disadvantage for every choice, but overall, I'm quite convinced that the UD v1 of having house as the root of "He is in the house" is a marked improvement over what we had before, and I'd vote for keeping it.

I think it is also important not to throw out the baby with the bathwater. We really would lose a lot for the common user if we got rid of having cop altogether. This is sort of a random anecdotal instance, but, nevertheless, as an example, look here: [https://github.com/spacy-io/spaCy/issues/259](Difference between spacy and Stanford Parser in results).

Little note:

yoavg commented 8 years ago

I support the proposal by @nschneid and @amir-zeldes .

@manning , re Hebrew: the common analysis is that the present tense form of the verb היה (was) is homophone with a pronoun. I think analyzing it as a resumptive pronoun is a bit weird. While some generative linguists do follow the resumptive pronoun analysis, this issue is still open afaik. But I am certain the very vast majority of users of the Hebrew universal treebank will expect the copular analysis.

gcelano commented 8 years ago

While I understand why it makes sense to have a label for the verb "to be", I am wondering whether "copula" is the most appropriate here. I would have no problem redefining "copula" in the UD context, but admittedly this name is a bit confusing.

amir-zeldes commented 8 years ago

I'd like to back up what @yoavg is saying, coming from a Hebrew linguistics education, my knee jerk reaction would definitely be to expect it to be called copula rather than resumptive, and it alternates with a morphological verb in other tenses, as Yoav mentioned.

As for @gcelano 's point, I think if we call it 'copula' this is actually a standard use, as opposed to 'copula verb', which designates a verb serving as a copula. The Wikipedia article for copula, for what it's worth, starts with just such a statement - copulas often are, but don't have to be verbs:

https://en.wikipedia.org/wiki/Copula_(linguistics)

In the linguistic tradition of quite a few languages we have 'things derived from a pronoun stem' which are termed copula (Hausa and Coptic come to mind), and in some languages, there is an idiosyncratic item which is not quite like other verbs, but not a pronoun either, such as Japanese. From a dependency perspective, these are all doing a very similar job, so I think the label cop really contributes to syntactic comparability here. A lot of the differences between these languages' copulas are more morphological in my opinion.

gcelano commented 8 years ago

The term "copula" is used with different meanings in the literature. Some authors also use "copula" for locative be ("the book is on the table") or even - rather hazardously - for existential be ("there is a book"). More recently, however, Huddleston and Pullum (CGEL) reserve the name only for the verb "be" when it is accompanied by a predicative complement (ie, only a NP or AP, not a PP). This latter definition - which I prefer - is in line with the definition of copula commonly used in Latin and Greek grammars (and other languages), which stems from the philosophical debate about being.

Using one label for all occurrences of "be" would make annotation a lot easier: admittedly, some uses of be + PP as non copular are of dubious interpretation ("it is of gold"), and without clear guidelines annotation inconsistencies are likely to arise within and across languages. On the other hand, using "copula" for all uses of "be" seems not to do justice to the importance of the distinction between copula stricto sensu ("this is green") and non-copular be, as found, prototypically, in existential sentences and with locative expressions.

If we adopt "copula" in its most general acceptation, I think this should be clearly specified in the documentation in order to avoid misinterpretations.

amir-zeldes commented 8 years ago

I agree with Giuseppe that 'strong' uses of the copula, such as existential predication, should not be marked as cop - that is functionally a very different thing, and many languages (including Hebrew) distinguish copula constructions from strong existentials.

However I would be against the requirement that copulas should be verbs. Huddlestone and Pullum's CGEL is, as its name suggests, as grammar of English: I don't think that we should take its definition of copula as a cross-linguistic recommendation. There is a lot of typological work on copulas, copula cycles etc. which very much assumes that non-verbs can be copulas, and copula verbs can evolve from non verbal copulas. The dissertation by Katz (1996) for example shows a great variety and degree of fluidity between pronouns and copulas: http://hdl.handle.net/1911/16974.

In an often cited definition, Hengeveld (1992:32) describes copulas using the auxiliary predicate marking function:

A copula enables a non verbal predicate to act as a main predicate in those languages and under those circumstances in which the non-verbal predicate could not fulfill this function on its own

And in a monograph dedicated to copulas, Pustet (2003: 5) adds a semantic emptiness criterion to Hengeveld's definition but keeps very much the same line of excluding POS or morphology from the definition:

A copula is a linguistic element which co-occurs with certain lexemes in certain languages when they function as predicate nucleus. A copula does not add any semantic content to the predicate phrase it is contained in.

I think the reason that these definitions don't interact with part of speech status is that they are interested in language comparison, so the focus is on what makes sense for a lot of languages (over 150 languages in Pustet!) rather than a specific one. And I think the syntactic functional category of the elements discussed is basically the same as that of verbal copulas in English. Based on what I understand the UD project to be doing (and I could be quite wrong about this!), I feel strongly that we should not annotate verbal and non-verbal copulas differently. For those interested in the morphological question of different copula types, the POS tags can be used to designate something as a verb or not a verb IMO, or we could even have more fine-grained feature distinctions if needed.

References

spyysalo commented 7 years ago

Closing as there is no recent activity and the v2 guidelines are now being published. Please consider opening a new issue with reference to the new guidelines and this discussion if there are open questions relating to this issue.