UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 246 forks source link

Is there a way to validate more words as copulas? #627

Closed jpiitula closed 4 years ago

jpiitula commented 5 years ago

Most of the remaining validation errors in Finnish-FTB are "is not a copula in language [fi]". The treebank treats occurrences of several words as "cop" but the validator only accepts "olla". Can I do something to keep at least the most frequent of these, or could someone? (The stray NOUN is related to the verb "tulla" and the analysis is probably motivated by that, but I'm not a grammarian myself.)

word count upos gloss
olla 3561 AUX be
tulla 146 AUX become
tehdä 30 AUX make
ei 22 AUX not
kasvaa 5 AUX grow
saada 4 AUX -
kehittyä 3 AUX -
tulo 2 NOUN -
täyttää 2 AUX -
odottaa 2 AUX -
kehkeytyä 2 AUX -
viimeistellä 1 AUX -
veikata 1 AUX -
toivoa 1 AUX -
sukeutua 1 AUX -
muovata 1 AUX -
muodostua 1 AUX -
leipoa 1 AUX -
kehittää 1 AUX -
ennustaa 1 AUX -
alkaa 1 AUX -
jnivre commented 5 years ago

UD is very restrictive about copulas, and normally just allows one verb in each language. This should be the completely neutral verb that just serves to link the nonverbal predicate to the subject and which corresponds to nothing in languages that don't use a copula in these constructions. A verb like "become" in English is not treated as a copula because it has more semantic content and correspond to a real verb in most other languages.

dan-zeman commented 5 years ago

More copulas per language are strongly discouraged but the guidelines (see here) describe some cases in which more copulas are allowed. For example, if their paradigms are defective and one copula is used in the present tense, while another copula is used in the past tense.

Given the English translations in your list, they don't seem to be copulas (e.g. to become is not a copula), but translations can be misleading so I don't want to take them as the only criterion.

Verbs that do not qualify as UD copulas are treated as content verbs, and the nominal/adjectival predicate is attached as a secondary predicate via xcomp.

jpiitula commented 5 years ago

Thank you both. The UD answer appears to be no.

This treebank is originally derived from a different analysis. I'll see if I can recognize the occurrences that are not too tricky to transform.

dan-zeman commented 5 years ago

The pull request https://github.com/UniversalDependencies/tools/pull/51 by @arademaker seems to have the same problem. I suspect that Portuguese ficar and viver are not copulas under UD guidelines.

bulbulistan commented 5 years ago

The actual wording of the guidelines is "The cop relation should only be used for pure copulas that add at most TAME categories to the meaning of the predicate, which means that most languages have at most one copula". So most, but not all. What are the exact criteria? And what is one to do with the individual level vs. stage level distinction (e.g. Spanish ser vs. estar)?

jnivre commented 5 years ago

The exact criteria are always language-specific, so each group must try to interpret the universal guidelines in a way that makes sense for the given language and does not create unmotivated differences across languages. In the Spanish treebanks, both "ser" and "estar" is treated as copulas. But, for example, inchoative verbs like "become" and "get" (and their equivalents in other languages) are not treated as copulas. There is a fairly detailed discussion of different constructions across a range of languages in the guidelines: https://universaldependencies.org/u/overview/simple-syntax.html#nonverbal-clauses

dan-zeman commented 5 years ago

I would love to have exact criteria but I am afraid they do not exist (or better: those that exist are not exact enough). The intuition behind it is that equivalents of to be can be copulas and equivalents of to become, to stay, to seem etc. are not. But as I said above, translations are often misleading, so they alone cannot serve as criteria. And although the "TAME" remark was intended to characterize the verb to be, a creative opponent can claim that to become is the equivalent of to be in different aspect.

bulbulistan commented 5 years ago

@jnivre "each group must try to interpret the universal guidelines" And don't these words fill you with dread? :) I do not disagree with your position, but look at it from our point of view: the wording of the guidelines and the initial interpretation by e.g. @dan-zeman above ("More copulas per language are strongly discouraged") would suggest a much stricter view than provided just now. The guidelines could certainly use an update, I guess I'm volunteering.

jnivre commented 5 years ago

I didn't mean to be any less strict than Dan (or the guidelines). The rule is "one copula per language", which means that you must have very strong arguments for exceptions. But to rule out the possibility of exceptions by stipulation would be counterproductive. And the whole UD enterprise depends on groups that have expertise in specific languages making informed choices in the spirit of the universal guidelines. This in itself does not frighten me, but what is sometimes a problem is that groups (consciously or not) make decisions in the spirit of a language-specific descriptive tradition instead of trying to honor the cross-linguistic perspective. That is why discussions like this are crucially important to UD. :)

lauma commented 5 years ago

Could a language theoretically have no copula at all, if be in some language is morphologically very similar to other verbs and constructions with it doesn't seem to stand out in any way?

jnivre commented 5 years ago

There are many languages without copulas, but then they don't use a verb at all in sentences like "she (is) smart". And I don't see why morphological similarity should be a criterion here.

sylvainkahane commented 5 years ago

I understand @lauma's question as follows: is the notion of copula in UD a purely semantic notion or does a copula have some particular syntactic or morphological properties?

jnivre commented 5 years ago

It is a morphosyntactic notion, since it is defined as a strategy for linking a nonverbal predicate to its subject, but there are no specific criteria concernning morphological properties (for example, inflection) or syntactic properties (for example, word order). Indeed, in many languages, the verb serving as a copula (like English "be") also has other uses (as is the case also for other auxiliaries like the equivalent of "have", which is multifunctional in many languages).

sylvainkahane commented 5 years ago

Sorry Joakim, but I don't see any argument in your answer explaining why the notion of copula in UD is not purely semantic. For instance, in French, we have some verbs that verbs that behave exactly as être 'to be' from the syntactic point of view, like devenir 'to become': same paradigm of complements, same agreement, same word order. I don't see any mophosyntactic reason to have a different analysis of être and devenir.

dan-zeman commented 5 years ago

I think the rule that we only accept the linking be verb and not other verbs actually is semantic. Since the original motivation was to make it parallel with languages that omit the copula, I guess that the be verb is the closest one to <NOTHING> when we want parallelism across languages.