Open dan-zeman opened 2 years ago
It might help to consider what the value of such refinements is. Is it to indicate something about the linguistic construction in question that could not be fully represented at the basic level? Is it merely a technical strategy to support enhancement code or annotation workflows? Will such subtypes be a help or hindrance to people writing treebank queries? Assuming people will be training EUD parsers, what would be the implications of saying that a predicted EUD edge is an error because it has the wrong extended subtype (say, nsubj:pass:xsubj
instead of nsubj:pass:xsubj:rel
)?
In general I agree with the sentiment that if information has been annotated, especially by hand, it would be a shame to lose it. But maybe subtypes are not the best place to put all of this info.
The orders that occur to me for subtypes are alphabetically (so nsubj:pass:rel:xsubj
) or else by clause size, with your example being :xsubj:rel
being inner-outer and :rel:xsubj
outer-inner, though that would break for things other than nested clauses.
Anyway, I'm in favor of this, though I wonder somewhat why we need subject vs object distinguished in both the main relation and the subtype.
We use the :sp
suffix for secondary predication: https://universaldependencies.org/uk/dep/nsubj-sp.html.
The :rel
s are just to distinguish enhanced relations from the core ones so they can be blue in brat :). I can just shave them off in our build script for now.
The x
es must be :xsubj
, that’s a bug, used them as a shorter version in brat but forgot to restore to :xsubj
on export.
consider what the value of such refinements is
In my view, it is sometimes useful not only to know that a relation is an addition in the enhanced graph (which can be figured out simply by comparing basic and enhanced graph) but also which of the six enhancement types is the reason to add the relation. This can be guessed with heuristics (and I have a script that tries to do so) but the heuristics are not always reliable and they are not trivial, so one can hardly use them in treebank queries. I see several benefits:
nsubj
relation from the relative clause to the modified nominal is different from the nsubj
relations inherited from the basic tree, as the nominal does not necessarily meet the syntactic requirements of subjecthood (e.g., nominative case).what would be the implications of saying that a predicted EUD edge is an error because it has the wrong extended subtype
That depends on how the people evaluating the parser want to define the task. They can ignore the extended subtypes if they want to, like we ignored any subtypes in the CoNLL 2017/2018 shared tasks. Dropping existing annotation is always easier than guessing non-existing annotation.
maybe subtypes are not the best place
Given that it affects individual relations, I can't think of a better place than DEPS, where individual relations are described.
The orders that occur to me for subtypes are alphabetically (so
nsubj:pass:rel:xsubj
)
The subtypes defined in the basic representation should go first, so if there is nsubj:pass
, it should not be interrupted by anything from the enhanced layer regardless the alphabet (incidentally it holds in this example, but it should hold also for nsubj:zzz
if defined as a language-specific subtype of nsubj
). But the debate about this point only makes sense if there is consensus that such subtypes should be allowed/encouraged/required by the guidelines.
why we need subject vs object distinguished in both the main relation and the subtype
We don't. It doesn't matter for xsubj
, which is a traditional extension proposed already in the Schuster & Manning (2016) paper, as the extra relations in control/raising constructions are always subjects. But I would prefer rel
over relsubj
and relobj
; first there is no need to add subj/obj
, which is already there, and second there are more possible :rel
relations than just subject and object. It could be also (at least) iobj
, obl
, nmod
.
The enhanced guidelines currently specify only three situations in which the enhanced graph contains additional relation (sub)types that are not allowed in the basic tree:
ref
relation for relative pronouns in relative clauses (this is a main type, not a subtype of another relation).nsubj:xsubj
,csubj:xsubj
,nsubj:pass:xsubj
,csubj:pass:xsubj
, … – used for the external subject relation in control/raising constructions.The third set of subtypes (
:xsubj
) is useful to recognize enhanced relations that have been added because of one particular enhancement type. However, the guidelines do not provide similar labels to recognize relations added because of the other enhancement types (the modified nominal in the relative clause, the shared parent or dependent in coordination). Should such subtypes be added to the guidelines?Even though the guidelines technically do not allow it, some treebanks already have such subtypes. The Dutch treebanks (@gossebouma) have
:relsubj
and:relobj
for nominals modified by a relative clause in which they are coreferential with subject or object, respectively. The Ukrainian treebank (@msklvsk) seems to use:rel
for any relation going from a relative clause back to the nominal; in addition it has also:sp
,supposedly for shared parents in coordination.As the validator now does not allow to list arbitrary enhanced relations, these extensions became errors and the treebanks slipped to the legacy status. Nevertheless, it seems like a pity to remove (and lose) these labels; wouldn't it be better to unify the Dutch and the Ukrainian approach and make the result part of the enhanced guidelines?Note that there may be relations that result from applying two enhancements in combination. For example: This is the boy who wanted to be selected, two enhancement types interact: external subject of control verb and relative clause. Hence there would be an enhanced relation from selected to boy and its type would be
nsubj:pass:xsubj:rel
(the guidelines would have to specify the order of the subtype segments).