UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

treatment of non-predication copula #737

Closed chiarcos closed 3 years ago

chiarcos commented 3 years ago

The UD v2 analysis of copular clauses is to mark the predicate as head and to attach the copula by cop. This is partially motivated by zero copula clauses in languages such as Russian. However, while the zero copula (for this language) seems to capture a prominent sub-group of copular clauses (predicational clauses), non-predicational uses seem to require an explicit copula (such as the demonstrative eto in "Zevs -- eto Jupiter" [Zevs -- Jupiter] or the emphatic copula est' in definitorial contexts). The existence of such uses may have been overlooked as the earlier discussion was largely based on presentational and existential uses of the copula (https://universaldependencies.org/v2/copula.html#guidelines-for-udv2). This may have been by intent, as these are categories of nonverbal predication*. However, copula clauses also include identificational uses (That woman is the Mayor of Cambridge. / That is Joe Smith.), equative/referential copula (The Morning Star is the Evening Star.) or specificational (What I don't like about John is his tie.).

This pattern occurs in other languages as well, where verbs that are formally identical with the copula are annotated as regular verbs if they contribute semantic information that exceeds beyond TAME, e.g., adverbial modifiers (https://universaldependencies.org/ga/dep/cop.html).

Suggestion:

Implications:

To be clarified:

sylvainkahane commented 3 years ago

UD makes a strong distinction between grammatical categories (AUX, ADP, SCONJ, etc) and lexical categories. This division is problematic for many reasons: there is no clear boundary between grammar and lexicon and classical distributional criteria to identify the head of a syntactic unit generally points to functional heads. Now, UD started like this and its initial choice can also have some advantages. We won't change UD now.

This is why we proposed another annotation scheme, SUD, which can be converted into UD. In SUD, auxiliaries and copulas of Indo-European languages are treated as heads because they are the distributional head of the clause. The complement of the copula is its comp:pred and this relation is reversed into cop in UD. In the same way, the lexical verb is the comp:aux of the auxiliary and this relation in reversed into aux in UD. Of course, it doesn't solve the question asked by @chiarcos, but it gives you a more homogeneous treatement of the structures. The differences are hidden in the labels (AUX vs VERB, comp:pred vs comp:obl).

All information about SUD is available on https://surfacesyntacticud.github.io. If you want to start your treebank in SUD and to convert it in UD after that, please contact us. All UD treebanks converted in SUD are also available on the SUD website and can be requested on http://match.grew.fr/?corpus=SUD_English-GUM@2.6.

amir-zeldes commented 3 years ago

I'm not sure this distinction is something annotators could easily make, especially in pronominal cases. Whether "that's that" or "this is it" is identificational seems difficult to establish definitively, and I wouldn't be optimistic about converting existing TBs to express this distinction automatically (especially for languages without articles). And a side-effect is that you'd also lose the isomorphism between cases with and without copula (where the predicate noun has to dominate the subject), which I've always liked about UD.

From a lexico-centric perspective, I'd often rather know what the predicate noun is, regardless of whether it's indefinite or not, which can be checked on the noun and its dependents. Personally this seems to me more like a referential semantic distinction for a coreference annotation scheme, rather than part of the syntax tree. @sylvainkahane I think if I understand you right in SUD all of these copulas would be the head right? So there wouldn't be a distinction there either?

chiarcos commented 3 years ago

Hi Amir. I see your point. But in non-predicational copula constructions this exact problem does exist, too, because if the relation holds between two non-predicates, either could be predicate (=head) and this needs to be decided ad hoc (e.g., always take the second as predicate -- but that's not what we see in all corpora) or contextually. At least for some languages, the latter approach is pursued and this is actually where considerable variation is found in UD, so we see cases with either first or second argument being predicate. For Maltese, Slavomír Čéplö just confirmed to me (offlist) that this choice involves a thorough consideration of the information structure involved (i.e., aboutness in a similar sense as in information-structural topics).

What I am suggesting is actually much more elementary and much more automatable. Given a list of grammatical devices that express referentiality (in the sense of Gundel et al. 1993) for a particular language, identify copular clauses with two referential(-marked) arguments as non-predicative, otherwise take the non-referential argument as predicate (head; = current treatment). The point here is that the exact referential function (aka, degree of givenness, according to Gundel et al. 1993) is part of the semantics of these constructions, but that additional pragmatic forces exist that allow speakers to deviate from this default interpretation (formalized in Gricean pragmatics in this case). As UD focuses on syntax and semantics, but not pragmatics, I suggest to ignore the pragmatic aspect in the UD annotation and to stay with the default semantics of the respective grammatical devices.

This may sound a bit abstract, but it really only means to filters to detect referential expressions, not to do any manual reannotation. Following Gundel et al. (1990, tab 13), filters for constructions marked light brown (for referring expressions that are "uniquely identifiable" or higher in their terminology or higher, resp. "referential" or higher according to their subsequent publications) should work [using their transcription]:

image

What is not covered in this diagram are names (which are always referential), and of course, there may be more variants (e.g., in gender, person and agreement features).

So, my suggestion is that whenever we see a copula connecting two expressions that match those filters, we annotate the copula as head. My feeling is that this can actually be simplified to match more cross-linguistically applicable terms so that it can be formulated as a relation between UD features, UD pos and UD deps and directly added to the validator.

Refs: Gundel, J. K., Hedberg, N., & Zacharski, R. (1993). Cognitive Status and the form of referring expressions in discourse. Language, 69, 274-307.

Jeanette K. Gundel, Nancy Hedberg, and Ron Zacharski (1990), Givenness, Implicature, and the Form of Referring Expressions. Proceedings of the Sixteenth Annual Meeting of the Berkeley Linguistics Society (1990), pp. 442-453

chiarcos commented 3 years ago

@amir-zeldes Coming back to pronominal cases, these would always be non-predicational, then. And this makes sense for English, because the (pragmatic interpretation of) sentences like "that's that" or "this is it" would carry a special meaning beyond the TAME information required by UD cop.

We would indeed loose the isomorphism between cases with and without copula, but only if non-predicational copular clauses without overt copula do exist to begin with. I'm not convinced they do (in any language). If I'm wrong about this, we should make it explicit that cop applies to non-predicates, i.e., to replace the following lines in the definition of cop:

OLD: A cop (copula) is the relation of a function word used to link a subject to a nonverbal predicate.

NEW:A cop (copula) is the relation of a function word used to link a subject to a nonverbal predicate or to a nonverbal non-predicate to which the function word asserts a predicative interpretation.

bulbulistan commented 3 years ago

For Maltese, Slavomír Čéplö just confirmed to me (offlist) that this choice involves a thorough consideration of the information structure involved (i.e., aboutness in a similar sense as in information-structural topics).

To clarify: I was referring to information structure to a) explain my decisions and b) to argue against @chiarcos' assertion that personal names can never be semantic predicates (and, by extension, against the entire concept of equative sentences). I'm with @amir-zeldes, this should not be a part of UD. If there's folk who consider it useful, by all means, do annotate your treebanks that way and then convert it to UD.

chiarcos commented 3 years ago

Yes, sorry for taking that out of context. It boils down to how to define a predicate. Does UD define a predicate? If not, this discussion cannot be settled, but then, the term "non-verbal predicate" should be not be used in the definition.

bulbulistan commented 3 years ago

Yes, sorry for taking that out of context. It boils down to how to define a predicate. Does UD define a predicate? If not, this discussion cannot be settled, but then, the term "non-verbal predicate" should be not be used in the definition.

Indeed and I believe that the Higgins distinction is going way too deep for UD purposes. There is no reason why it could not be part of extended annotations, but as core UD, eh, I don't know.

chiarcos commented 3 years ago

BTW: My personal understanding of "predicate" comes from semantic type theory, i.e., logics. But it's by no means theoretically motivated, but from the physical pain (half joke only) of disentangling the scope of adnominal and adverbial dependents when using UD annotations for knowledge extraction, especially in combination with dependents not marked for having a nominal or verbal head (conj, in particular; in UD v.1 also nmod and neg) or in cases of dependents with elided heads. Having a better grasp on entities (by eliminating predicates grammatically marked to be referential) would help a lot. So, here, I'm coming not from a data provider but from a consumer perspective. The move from UD v.1 to UD v.2 was already a great step in this direction, and in fact, SUD would help here, too -- but that works only to the extent that data is systematically provided as SUD rather than UD.

bulbulistan commented 3 years ago

For the record, I'm 100% with you on conj.

amir-zeldes commented 3 years ago

Yes, conj is nefarious that way :) (and imagine you can also get conj of predicational and identificational copula - "Kim is the prime minister and totally unreliable"), but without enhanced dependencies there's only so much we can do...

@chiarcos I agree these are interesting constructions to distinguish, but I still think the difference is semantic, and not syntactic - the tree can look the same for both and other annotations can show the difference.

As for identificational predication without a copula, I think it's normal in many languages. In Hebrew the most normal way of introducing yourself is:

which is identificational and has no copula. In Coptic, first and second person identification can have no copula, but third person must... And in languages without articles, it can often be hard to tell which semantic type we are looking at. If the only difference is a semantic one, then I would punt this distinction to a non-syntactic annotation layer.

chiarcos commented 3 years ago

Am .10.2020, 22:16 Uhr, schrieb Amir Zeldes notifications@github.com:

@chiarcos I agree these are interesting constructions to distinguish,
but I still think the difference is semantic, and not syntactic - the
tree can look the same >for both and other annotations can show the
difference. Honestly, any argument for a syntax-first over a semantics-first approach
to annotation in this case is relatively weak in this case. The difference
between having or not having an overt copula is a syntactic one as well,
so the syntax-first principle isn't exactly throroughly applied here (nor
is it supposed to be).

As for identificational predication without a copula, I think it's
normal in many languages. In Hebrew the most normal way of introducing
yourself is: ani amir - lit. "I Amir"

which is identificational and has no copula. But is here any reason to analyze that as a copular clause (in line with
the underlying semantics) rather than as an apposition (in line with the
surface syntax)?

In any case, the main problem is that predication is not defined. I
personally see no reason to not adopt the type-theoretic definition,
because that's pretty established in certain branches of linguistics, but
then, non-predicative copula does exist and must be dealt with. Either by
defining it out of scope from cop (which seems to be intended in the
current text) or by defining it as within the scope of cop (as does
exist in the data). Other definitions may have other implications, and
possibly, non-predicative copulas clauses don't exist under these
definitions, but then a definition should make that explicit either.

In Coptic, first and second person identification can have no copula,
but third person must... And in languages without articles, it can often
be hard to tell >which semantic type we are looking at. If the only
difference is a semantic one, then I would punt this distinction to a
non-syntactic annotation layer.

amir-zeldes commented 3 years ago

But is here any reason to analyze that as a copular clause (in line with the underlying semantics) rather than as an apposition (in line with the surface syntax)?

Oh, no, sorry, I should have included a translation - it's definitely not an apposition, it means "I am Amir". It's very typical in Semitic languages historically and comparatively. For example it's attested in the first commandment, "I am Jehova, your God", which also has no copula in the original Hebrew. The second part, "your God", is an apposition, but there is definitely predication between "I" and the rest.

Like I said, I'm interested in the predication vs. identification distinction, but I think there is no reliable, cross-linguistic and purely form-based criterion to distinguish them. There are also a lot of other murky uses of cop in other languages that are hard to pin down, such as VP + cop, which appears in African languages like Hausa and Coptic. You can basically say:

Or:

And it means something like "it's the case that Kim went" (details are somewhat more complex). I'm not sure if such uses are identificational, but I think there's a strong motivation to keep 'went' as the head either way, even though I'd struggle to see this as meaningfully predicational.

bulbulistan commented 3 years ago

It's very typical in Semitic languages historically and comparatively.

Same with Arabic, whether Classical/Modern Standard or modern varieties, same in Maltese. And it happens in interrogative sentences, too: Min jien? Jiena Doktor Alex Matrenza Who I? I Doktor Alex Matrenza

Hungarian does the same thing: Ki maga? A nevem Sándor Petrovics. Who you? DEF my name Sándor Petrovics

chiarcos commented 3 years ago

Am .10.2020, 21:48 Uhr, schrieb Amir Zeldes notifications@github.com:

But is here any reason to analyze that as a copular clause (in line with the underlying semantics) rather than as an apposition (in line with the surface syntax)?

Oh, no, sorry, I should have included a translation - it's definitely
not an apposition, it means "I am Amir". It's very typical in Semitic
languages >historically and comparatively. For example it's attested in
the first commandment, "I am Jehova, your God", which also has no copula
in the original >Hebrew. The second part, "your God", is an apposition,
but there is definitely predication between "I" and the rest. Then, this might actually be a case of non-predicational zero copula --
unless "I" is predicative (in the sense of "being me"). If this can be
confirmed, I would suggest to clarify the UD v.2 definition of cop and
to add the note that the predicate can also be a referent, e.g., a named
entity, to add Arabic, Maltese and Hungarian examples and to close the
issue. This seems to be the most natural solution in the context of UD
v.2, but I think the clarification in the definition is necessary because
this is can be counter-intuitive.

If there ever is an initiative to develop UD v.3 guidelines, that issue
should be addressed, again.

Like I said, I'm interested in the predication vs. identification
distinction, but I think there is no reliable, cross-linguistic and
purely form-based criterion to >distinguish them. There are also a lot
of other murky uses of cop in other languages that are hard to pin down,
such as VP + cop, which appears in >African languages like Hausa and
Coptic. You can basically say: Kim went

Or: Kim went COPULA

And it means something like "it's the case that Kim went" (details are
somewhat more complex). I'm not sure if such uses are identificational,
but I think >there's a strong motivation to keep 'went' as the head
either way, even though I'd struggle to see this as meaningfully
predicational. Copulas tend to be grammaticalized into focus particles, and this looks
very much like such a use. But as the term "grammaticalization" implies,
this actually involves an internal change of the grammatical function of
the copula, so, we're not necessarily looking at a copula here, but at a
particle that happens to be identical in form with the copula. And of
course, went would be head in that case.

Coming back to Sumerian (not the primary language I'm working on, but one
that got considerable attention in a recent project of mine), the
grammaticalization of the copula went pretty far, so that out of a use of
the copula as an emphatic particle, the enclitic copula can be basically
used in place of morphological markers of case (every case, as far as I
can tell). Sentences may thus contain several copulas in different
functions -- at the same time, predicative copula is optional. I didn't
find a non-predicational zero copula, though, except none that would also
be analyzable as an apposition.

amir-zeldes commented 3 years ago

OK, if these examples are satisfactory I'm adding them to cop in 2d0376d along with listing identity as one of the interpretations of cop.

Grammaticalization into information structural markers is also very interesting, and I definitely think it's relevant for the African cases I've seen, but again something I'd rather see in FEATS than in the syntax tree itself. There's a lot of variability in what relations mean semantically across languages (think about permanent vs. temporary predication with cop in some languages, for example), but it's useful to have a relatively rough an comparable formalism where the majority of "A is B" things have the same analysis, and more fine-grained distinctions can optionally be added in other layers.

amir-zeldes commented 3 years ago

I also just noticed BTW, the Russian example on that page, 'Ivan is the best dancer', has a definite interpretation too, at least under the assumption of superlative uniqueness, so it can be construed as an identity predication as well. But of course in Russian this is harder to identify due to the lack of articles.

chiarcos commented 3 years ago

(This is no suggestion to re-open the issue. It remains closed as discussed before, but for the record, and for future discussions in a possible UD revision, I add a complex example that shows what we loose.)

Sumerian, according to Jagersma (2010):

Annotating the functional head rather than the morphological head of copula clauses can produce complicated structures.

# Jagersma, Chap. 8 (116)                   
# ‘whoever she is not, whoever she is.’                 
# (Cyl A 4:23; L; 22)                   
1   a-ba-   a.ba=Ø  who=ABS 0   root
2   me-a    'i-me-Ø-'a=Ø    VP-be-3SG.S-NOM=ABS 1   cop
3   nu  nu  NEG 1   cop
4   a-ba-   a.ba=Ø  who=ABS 1   acl
5   me-a-né 'i-me-Ø-'a=ane=e    VP-be-3SG.S-NOM=her=DIR 4   cop

(This is not UD, but CDLI-CoNLL, because UD FEATS do not preserve the order of morphemes. But this is what triggers the syntactic structure.)

The internal structure is clearer from the phrase structure:

(        # empty head (headless clause) => head stays on 1
 (       # clause; syntactic head: NEG/cop => annotate 1
  (      # nominalized, head stays
   (     # clause; syntactic head: cop   => annotate 1
      who=ABS
      VP-be-3SG.S ) # cop, end of "who is"
     –NOM         ) # relative clause, end of "that who is"
     =ABS           # abs argument of "is not"
      NEG         ) # neg. cop, end of "that who is is not"
 (       # relative clause => depends on head of preceding clause
  (      # clause; syntactic head: cop   => annotate 4
      who=ABS
      VP-be-3SG.S ) # cop, end of "who is"
     -NOM         ) # end of relative clause "that who is"
     =her           # assume that this takes scope over both sentences [i.e., the first]
     =DIR         ) # DIR is the case of the empty head

Discrepancies between morphological (carrying verbal inflection) and functional (= UD) heads:

UD is necessarily limited with capturing scope, but by annotating the functional head, we loose more than necessary and cannot distinguish between arguments of the non-negative copula clause and the negated copula clause that contains it. It's also largely impossible to recover that information from UD.

If the negative-polarity copula is considered non-predicative (because it actually asserts non-identity rather than a predication) and treated as head, we would be able to distinguish both clauses.