Closed fginter closed 9 years ago
Yes, we have over 4000 trees in Czech where the root has more than one child (i.e. several words are attached directly to the root, with the "root" relation). Although I cannot guarantee that some weird stuff is hidden there too, most of them are elliptical sentences with deleted main verb. The PDT analysis attaches the orphaned verb dependents directly to the (artificial) root node in such cases. I do not think we have a better solution in UD, do we?
Example: Co na to MF = "What [does] MF [say] to it"
I have added a section on ellipsis in the Czech documentation and mentioned this example of multi-top-node explicitly there.
http://universaldependencies.github.io/docs/cs/overview/specific-syntax.html#ellipsis
I may change it in future if we find a better solution. I also added a section on ellipsis in the universal documentation of specific syntax but I omitted this issue there because there apparently is not a consensus (yet – but I wonder how people solve similar situations in other languages).
http://universaldependencies.github.io/docs/u/overview/specific-syntax.html#ellipsis
Quite a few of the nulls in the TDT corpus were also fragments lacking a main verb. When converting the treebank to create the UD Finnish corpus, we created single-rooted trees by selecting one of the dependents of the null main verb (using POS- and dependency-based heuristics), making it the root, and hanging the rest on it before removing the null.
I think the important question here is whether or not it is a format violation to have multiple roots. If yes, any way of resolving these is by definition an improvement over invalid data.
In https://github.com/UniversalDependencies/tools/issues/3, @manning commented
Yes, I think that single-rooted should be the default. My belief is that that is how UD is meant to work....
I seem to recall earlier discussions with @jnivre supporting this idea. Joakim, should valid UD trees be able to have multiple roots?
As a matter of fact, I used to think they should but I have changed my mind and think we should enforce the single-root constraint across the board. The main assumption in UD is that the root word is a predicate (or a nominal in the case of bare nominals). If the real predicate (or nominal) has been elided, the "closest in rank" has to be promoted to take its place. This is a principle that we apply in many other situations, so I don't see why it shouldn't be used to eliminate multiple roots as well.
Thank you for the clarification! If there's agreement on this constraint, the documentation should probably be updated to state it clearly (at least http://universaldependencies.github.io/docs/format.html doesn't appear to mention the constraint at the moment).
We might also consider separately contacting the authors of the UD corpora that violate this constraint (from above, cs,fr,de,hu,es,sv
) as the validator was recently changed to enforce this by default.
@jnivre: But we do not apply this principle in Peter won silver and Jane gold. Now suppose that And Jane gold. occurs as a separate "sentence". If the rules force us to promote one of the arguments to the head position, what do we do? Subject? Object?
I am starting to think we should consider the NULL nodes for some future version of the UD standard. And leave it for heuristics to get rid of them if people prefer not having them, much as we do with the two-level tokenization.
On the other hand, we could as well technically promote one of the orphans (maybe always the first one) and then define a language-specific relation between the promoted orphan and the other ones (root:root
? :-) or maybe discourse:root
or something, so that backing off to the universal relations does not yield root
). Then it will be a perfectly reversible transformation, and people can rely on the root having just one child node (it will even start to make sense to call the root
-labeled node "the root". So far I did not feel OK with that term, as the artificial node indexed 0 is the actual root.)
This is definitely worth thinking about (as are NULL nodes), but it will have to be for version 2 of the guidelines. For now, we have to live with the relations we have and either enforce the single-root constraint or not.
If it is in the language-specific documentation, I believe I can do it even now (the special relation). Until yesterday the documentation of the Czech relations was not even marked as "complete draft".
On the other hand, the single-root constraint has not been part of the frozen 1.0 universal standard, so it is questionable whether we can actually enforce it now.
I am fine either way, it would be just a slight change in my conversion tool, and a paragraph in the Czech documentation. But obviously for version 2 I would like to coordinate with others so that we use the same solution if possible.
You are right. It can be done in the language-specific documentation but I would advise against doing a quick solution here. Then I would probably prefer not to enforce the single-root constraint for this release on the grounds that it wasn't explicitly part of the v1 guidelines (although I think many people implicitly assumed that it was).
FWIW, we (Turku) assumed this constraint held when creating the v1 UD Finnish.
Looking at the stats of the current treebanks, the vast majority passes or almost-passes the single-root requirement. So I would tend towards standardizing it, even if it may not happen in time for the May15 release. I would also expect many parsers to make this assumption and break if it is violated by the data, so this would probably also be a friendly thing for the tool developers.
@fginter: thanks for injecting data into the discussion! Could you maybe quantify "the vast majority passes or almost-passes"? (Which treebanks don't?)
As of right now, among the languages which have any data, 7 pass fully and further 3 have only a small number of deviatons (less than 100 trees total). So that is 10 passing or nearly passing. Two then don't pass with much larger number of trees. The rest of the repos contain no verifiable data.
Dan makes some good arguments in this thread! I've historically been pro-single root and anti-empty elements. I've felt that this was just right for a simple analytic framework for regular human beings. Maybe after spending enough time in linguistics seeing sentence analyses with more empty elements than overt words, you seek solid ground in the opposite extreme.
remnant
was an attempt to handle predicate ellipsis without empty elements (which most frameworks assume in one way or another). It works fairly well for the more common case of ellipsis in conjoined sentences. But Dan is right to note that it doesn't give an easy answer to separate elliptical sentences. And indeed, the choices readily available would make their analysis non-parallel to the treatment of ellipsis in conjoined sentences. Clearly this still involves more thought.
I do think we should do a version 2 at some point (maybe after 1.2?). While the cost of revising the standard is considerable, I think it would be head-in-the-sand to believe that we did everything so well in version 1 that there is nothing that could be done better in a revised version. And since making treebanks more consistent in their choices of analyses will also require ongoing work by all contributors, this revision would pretty much be part of that process, only more "compulsory" -- since the set of allowed dependencies would likely change.
Otherwise, for the moment, I'll move this item to milestone version 1.2....
I agree on both counts. The treatment of ellipsis and the way it interacts with single-rootedness and perhaps even empty categories is worth revisiting. And the work on making treebanks more consistent could lead naturally to an improved second version of the guidelines. There is a fine line between adding more specific guidelines for areas that weren't covered by version 1 and where different teams have therefore made different choices and simply changing parts of version 1 in the interest of making the annotations more consistent or more adequate.
Although the Uppsala meeting has not found a widely acceptable solution to ellipsis, the original issue of this thread has been clearly decided: multi-root trees are disallowed even if it means that top-level ellipsis must be annotated using promotion. AFAIK this rule is now followed in all treebanks of UD 1.2, so I'm closing the issue.
Just a note. As per the discussion in https://github.com/UniversalDependencies/tools/issues/3 the default now is to validate that every tree is single-rooted. On the current github data: Pass:
en,fi,ga,it
. Fail:cs,fr,de,hu,es,sv
.