Subtypes for individual enhancements in enhanced dependencies?

dan-zeman commented 2 years ago

The enhanced guidelines currently specify only three situations in which the enhanced graph contains additional relation (sub)types that are not allowed in the basic tree:

The ref relation for relative pronouns in relative clauses (this is a main type, not a subtype of another relation).
The “case” enhancements: lemma of adposition or conjunction is added to certain dependency types.
nsubj:xsubj, csubj:xsubj, nsubj:pass:xsubj, csubj:pass:xsubj, … – used for the external subject relation in control/raising constructions.

The third set of subtypes (:xsubj) is useful to recognize enhanced relations that have been added because of one particular enhancement type. However, the guidelines do not provide similar labels to recognize relations added because of the other enhancement types (the modified nominal in the relative clause, the shared parent or dependent in coordination). Should such subtypes be added to the guidelines?

Even though the guidelines technically do not allow it, some treebanks already have such subtypes. The Dutch treebanks (@gossebouma) have :relsubj and :relobj for nominals modified by a relative clause in which they are coreferential with subject or object, respectively. The Ukrainian treebank (@msklvsk) seems to use :rel for any relation going from a relative clause back to the nominal; in addition it has also :sp, ~~supposedly for shared parents in coordination.~~ As the validator now does not allow to list arbitrary enhanced relations, these extensions became errors and the treebanks slipped to the legacy status. Nevertheless, it seems like a pity to remove (and lose) these labels; wouldn't it be better to unify the Dutch and the Ukrainian approach and make the result part of the enhanced guidelines?

Note that there may be relations that result from applying two enhancements in combination. For example: This is the boy who wanted to be selected, two enhancement types interact: external subject of control verb and relative clause. Hence there would be an enhanced relation from selected to boy and its type would be nsubj:pass:xsubj:rel (the guidelines would have to specify the order of the subtype segments).

nschneid commented 2 years ago

It might help to consider what the value of such refinements is. Is it to indicate something about the linguistic construction in question that could not be fully represented at the basic level? Is it merely a technical strategy to support enhancement code or annotation workflows? Will such subtypes be a help or hindrance to people writing treebank queries? Assuming people will be training EUD parsers, what would be the implications of saying that a predicted EUD edge is an error because it has the wrong extended subtype (say, nsubj:pass:xsubj instead of nsubj:pass:xsubj:rel)?

In general I agree with the sentiment that if information has been annotated, especially by hand, it would be a shame to lose it. But maybe subtypes are not the best place to put all of this info.

mr-martian commented 2 years ago

The orders that occur to me for subtypes are alphabetically (so nsubj:pass:rel:xsubj) or else by clause size, with your example being :xsubj:rel being inner-outer and :rel:xsubj outer-inner, though that would break for things other than nested clauses.

Anyway, I'm in favor of this, though I wonder somewhat why we need subject vs object distinguished in both the main relation and the subtype.

msklvsk commented 2 years ago

We use the :sp suffix for secondary predication: https://universaldependencies.org/uk/dep/nsubj-sp.html. The :rels ~~are just to~~ distinguish enhanced relations from the core ones ~~so they can be blue in brat :)~~. I can just shave them off in our build script for now. The xes must be :xsubj, that’s a bug, used them as a shorter version in brat but forgot to restore to :xsubj on export.

dan-zeman commented 2 years ago

consider what the value of such refinements is

In my view, it is sometimes useful not only to know that a relation is an addition in the enhanced graph (which can be figured out simply by comparing basic and enhanced graph) but also which of the six enhancement types is the reason to add the relation. This can be guessed with heuristics (and I have a script that tries to do so) but the heuristics are not always reliable and they are not trivial, so one can hardly use them in treebank queries. I see several benefits:

We could generate statistics of enhancements of individual types.
We could query the treebanks and look for enhancements of a particular type.
The validator could operate more efficiently on the enhanced graphs. For example, it could flag relations that are added for no obvious reason, so perhaps they are a mistake.
In the IWPT EUD parsing shared tasks, we wanted to be able to distinguish individual enhancements in the evaluation.
Occasionally there are linguistic implications, too. For example, a nsubj relation from the relative clause to the modified nominal is different from the nsubj relations inherited from the basic tree, as the nominal does not necessarily meet the syntactic requirements of subjecthood (e.g., nominative case).

what would be the implications of saying that a predicted EUD edge is an error because it has the wrong extended subtype

That depends on how the people evaluating the parser want to define the task. They can ignore the extended subtypes if they want to, like we ignored any subtypes in the CoNLL 2017/2018 shared tasks. Dropping existing annotation is always easier than guessing non-existing annotation.

maybe subtypes are not the best place

Given that it affects individual relations, I can't think of a better place than DEPS, where individual relations are described.

The orders that occur to me for subtypes are alphabetically (so nsubj:pass:rel:xsubj)

The subtypes defined in the basic representation should go first, so if there is nsubj:pass, it should not be interrupted by anything from the enhanced layer regardless the alphabet (incidentally it holds in this example, but it should hold also for nsubj:zzz if defined as a language-specific subtype of nsubj). But the debate about this point only makes sense if there is consensus that such subtypes should be allowed/encouraged/required by the guidelines.

why we need subject vs object distinguished in both the main relation and the subtype

We don't. It doesn't matter for xsubj, which is a traditional extension proposed already in the Schuster & Manning (2016) paper, as the extra relations in control/raising constructions are always subjects. But I would prefer rel over relsubj and relobj; first there is no need to add subj/obj, which is already there, and second there are more possible :rel relations than just subject and object. It could be also (at least) iobj, obl, nmod.

UniversalDependencies / docs

Subtypes for individual enhancements in enhanced dependencies? #873