UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 246 forks source link

Copula: VERB vs. AUX #275

Closed dan-zeman closed 7 years ago

dan-zeman commented 8 years ago

Since the first release of the UD guidelines in October 2014 copula verbs were to be tagged VERB and not AUX. But the stance was not unanimous in the core UD group. Should we revise the decision for version 2 of the guidelines?

The discussion at that time was done by e-mail. I am going to post the relevant messages here so we have a base for further arguments.

dan-zeman commented 8 years ago

@jnivre (15.9.2014): AUX: Part of the definition of AUX says that it should be associated with a lexical verb. This would seem to imply that the copula is not an auxiliary verb (whereas the use of “be” to form the progressive form and passive voice is). Is everyone happy with this? I have also suggested that an exact definition of AUX (for languages that have auxiliaries at all) should be included in the language-specific documentation.

dan-zeman commented 8 years ago

@dan-zeman (17.9.2014): I agree that copula is not the same as auxiliary and it also is not a normal verb. Since we do not have a dedicated tag for copulas, we have to put them somewhere i.e. either VERB or AUX. The same holds for modal verbs. I have no strong preference here but I am slightly inclined to keep copulas among main verbs and modals among auxiliaries.

@jnivre (17.9.2014): Thanks, Dan. I share your view concerning the VERB/AUX distinction (exclude copula, include modals)

dan-zeman commented 8 years ago

@manning (26.9.2014): Hi Joakim and everyone,

While generally agreeing with what you’ve been writing up for principles, you put in:

Note that copula verbs, despite being dependents of their predicates, are treated as main verbs in this respect and take auxiliaries as dependents. The general rule is that an auxiliary should always attach to a verb (if there is one).

pastedgraphic-1

This isn’t what we’ve been doing. We’ve been more radically content-word-as-head than that, and would have

nsubj(sick, She) aux(could, sick) aux(have, sick) aux(been, sick)

I suspect that this is actually the better way to go, at least for English. This is especially the case for participles. There are the usual decades of linguistic literature arguing that participles sometimes function as adjectives, indicated by things like being gradable, negatable with ‘un’, etc.:

The book is interesting, The book was quite interesting, The book was uninteresting

But for others there is evidence of them being verbs, such as by being able to use ‘re-‘ with them.

The book is decaying, The book is repositioning the status of women.

But, as it notes in the Penn Treebank tagging manual, in practice “The distinction between adjectives and gerunds/present participles is often very difficult to make.” An example that is done both ways is with “will be destabilizing"

This way of doing things would mean that the dependency structure would change depending on which way you voted on the part of speech. While it’s a shame that the part of speech is often assigned inconsistently/arbitrarily in various situations, it seems to me a much bigger problem if that also changes the skeleton of the dependency tree.

So, unless there are compelling reasons to do otherwise, I’d argue for doing it the way we’ve been doing it…. Is there a good reason to do things this way?

dan-zeman commented 8 years ago

@jnivre (27.9.2014): On second thought, I tend to agree with @manning. Always attaching to a verb seemed like a nice idea at the time, but always attaching to the predicate is probably better in the long run. I think the Finnish treebank attaches auxiliaries to copulas, but this is probably because they use a nested structure for auxiliaries in general.

dan-zeman commented 8 years ago

@fginter (29.9.2014): This is in line with the rest of the decisions, so for consistency reasons I'm okay(-ish) with it and will rehang the auxiliaries in TDT. But since this is not how we did it in TDT, I suppose it says we think the earlier analysis was more in agreement with our intuition. Especially since the auxiliaries can never be there without the main copula verb, they seemed to be clearly bound to it.

jnivre commented 8 years ago

It seems to me that the logical conclusion of this discussion should have been that copulas are AUX because they are treated structurally as auxiliaries (not taking auxiliaries themselves, and being siblings of auxiliaries). Unless there are strong arguments to the contrary, I would therefore advocate this change for v2.

@dan-zeman: Thanks for digging this discussion out of your email archive and posting it on github.

manning commented 8 years ago

@jnivre: I also see how making copulas AUX seems more consistent with the predicate-as-head structures we assign in cases like "She is smart".

nschneid commented 8 years ago

While I see the structural argument that copulas are like auxiliaries, I worry that expanding the definition of the AUX tag will confuse people used to the traditional definition of auxiliary as a function word that accompanies a main verb. (Is there a standard term that covers verbal function words? I think "support verb" is in the right ballpark but not quite right.)

If we want to use the POS tagset to express that copulas are not-quite-main-verbs, why not introduce a new tag (COP) and remove all confusion? Then again, if the distinction is already being marked in the dependency relation (aux vs. cop), why do we need the tag distinction at all—why not just call them all verbs?

dan-zeman commented 8 years ago

One thing that I like about copulas becoming AUX is that ambiguity between periphrastic passive on one side, and copula+participle on the other side, will only affect the deprel and not the POS tag. (Example: the contract was signed after the lunch ... auxpass(signed, was) vs. the contract is signed at the last page ... could be cop(signed, is), because the phrase describes the state of the contract rather than the act of signing.)

Thus I would not be happy with introducing a COP tag. Of course removing the AUX tag and keeping only VERB would solve this particular issue as well. I think I could live without AUX but I believe there were people who find this distinction important (the tag was not present in the original Google universal tag set but it was added shortly before the UD project started).

jnivre commented 8 years ago

I would definitely rather remove the AUX tag than add a COP tag. The AUX/VERB distinction is great for parsing, but only if the tagger gets it right. For Swedish there is a 5 percent absolute difference in parsing accuracy between using gold and predicted tags, which is due almost exclusively to the AUX/VERB distinction.

nschneid commented 8 years ago

For Swedish there is a 5 percent absolute difference in parsing accuracy between using gold and predicted tags, which is due almost exclusively to the AUX/VERB distinction.

What happens if AUX and VERB are collapsed into one tag—is the parser able to learn the distinction?

jnivre commented 8 years ago

Sort of but not quite. When you go from gold to predicted tags with the AUX/VERB distinction, you lose 4-5 percentage points. If you then go to predicted tags without the distinction, you lose another 0.5-1 points (if I remember correctly). Then again, this is only evidence from a single language. It would be interesting to know what happens in other languages.

nschneid commented 8 years ago

Interesting. Is the POS tagger trained on the same data as the parser? If so, it's curious that it can capture the distinction a bit better.

nschneid commented 8 years ago

http://aclweb.org/anthology/W/W16/W16-1202.pdf, Table 6 might be relevant here: I think the first 4 rows are with gold POS, τ_o = original POS, τ_a = ambiguous (collapsed AUX into VERB). Collapsed tags drop performance by 1.1 points in Slovenian and .4 points in Czech.

jnivre commented 8 years ago

Yes, they are with gold tags.

For Swedish POS tagging, the situation is complex. Most Swedish taggers are trained on the Stockholm-Umeå Corpus, which is 10 times larger than the treebank (1M vs. 100K tokens), but this corpus does not make the AUX/VERB distinction. So we have to train a second tagger on the treebank itself, but we only trust the second tagger for the AUX/VERB distinction (and only applies it if the first tagger has tagged a word as VERB). To further improve tagging accuracy for this distinction, we apply a few hand-crafted heuristics to the output of both taggers, which rely on the fact that Swedish syntax pretty much always requires the main verb to go after the auxiliarie(s). With this combined system, we achieve over 90% accuracy on the AUX/VERB distinction, but we still observe a 4-5 percent drop compared to gold tags, showing that the few cases it gets wrong lead to really bad parses (essentially because the root dependency is wrong and so many other dependencies depend on this). Finally, you have to remember that these are results for a greedy transition-based parser. Other parsers might behave differently.

By the way, I think we are digressing from the original issue, so if you want to continue discussing tagging and parsing I suggest we go off line. :)

amir-zeldes commented 8 years ago

Regarding the tag issue for copulas, I'd like to point out that in many languages they are not verbs at all:

The list goes on... I think the cop deprel captures what all of these do quite well, but categorizing them all as a POS tag VERB will probably be less than ideal for many of the cases.

jnivre commented 8 years ago

Good point. They should obviously not tagged as either VERB or AUX if they are not verbs at all. But this doesn't resolve the issue of what to do with languages where they are verbs.

fginter commented 8 years ago

If you go offline don't drop @fginter :)

nschneid commented 7 years ago

Is there a more natural name we could use if AUX is broadened to include copular verbs? I don't necessarily object to giving them the same tag, I just worry that the name AUX will confuse people if it includes the copula, which is traditionally considered a main verb (at least in English grammar). In my mind, "auxiliary" means a verbal word that accompanies the main verb.

"Function verb" is the best term I can think of, but I bet somebody else can do better.

spyysalo commented 7 years ago

Closing as there is no recent activity and the v2 guidelines are now being published. Please consider opening a new issue with reference to the new guidelines and this discussion if there are open questions relating to this issue.