UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

Can there be empty multiword tokens? #932

Closed AngledLuffa closed 1 year ago

AngledLuffa commented 1 year ago

Reading over some of the documentation for empty words, such as

https://universaldependencies.org/format.html

https://universaldependencies.org/u/overview/enhanced-syntax.html#ellipsis

I have a couple questions about corner cases with empty words, especially with MWT. Is it possible for there to be

Also, what is the index of an empty word at the start of a sentence? 0.1?

nschneid commented 1 year ago

@dan-zeman could say for sure what the validator enforces, but

  1. My understanding is that the MWT device is merely intended to explain the surface string for words that are merged by grammatical rule (such as a contraction) without a space. As such, an empty node could occur between two parts of an MWT, but shouldn't begin or end it—when it says the MWT is indexed with an integer range like 1-2 or 3-5, I take it that excludes 3-5.1. I don't think we need an MWT for a hypothetical contraction where one of the words is missing (in your example, for instance, it could be "it is" just as well as "it's"; a more difficult case would be "wo" understood as truncated "won't", but I would still say it's superfluous to call that an MWT).

  2. I would assume that an empty node at the start of the sentence is 0.1, yes.

dan-zeman commented 1 year ago

Multiword tokens are things that exist on the surface (and do not directly correspond to nodes). Empty nodes are things that do not exist on the surface. Therefore, empty nodes cannot be part of multiword tokens.

That does not necessarily imply that the linear position of an empty node (which is currently not regulated by the EUD guidelines) cannot fall between two words that are members of a multiword token. I just tested that the validator will not complain about this:

4   regnode _   NOUN    _   _   0   root    0:root  _
5-6 MWT _   _   _   _   _   _   _   _
5   x   _   NOUN    _   _   4   nmod    4:nmod  _
5.1 empty   _   NOUN    _   _   _   _   4:nmod  _
6   y   _   NOUN    _   _   4   nmod    4:nmod  _
7   regnode _   NOUN    _   _   4   nmod    4:nmod  _

It also does not complain about either of the following two, but I think it should actually allow only the first form and ban the second one.

4   regnode _   NOUN    _   _   0   root    0:root  _
4.1 empty   _   NOUN    _   _   _   _   4:nmod  _
5-6 MWT _   _   _   _   _   _   _   _
5   x   _   NOUN    _   _   4   nmod    4:nmod  _
6   y   _   NOUN    _   _   4   nmod    4:nmod  _
7   regnode _   NOUN    _   _   4   nmod    4:nmod  _
4   regnode _   NOUN    _   _   0   root    0:root  _
5-6 MWT _   _   _   _   _   _   _   _
4.1 empty   _   NOUN    _   _   _   _   4:nmod  _
5   x   _   NOUN    _   _   4   nmod    4:nmod  _
6   y   _   NOUN    _   _   4   nmod    4:nmod  _
7   regnode _   NOUN    _   _   4   nmod    4:nmod  _

An empty node at the start of the sentence is 0.1, yes.

AngledLuffa commented 1 year ago

Excellent, thanks for all of the clarifications. Can I propose adding these examples, or real ones using real words, to the documentation? I'll make a PR for it myself if someone would point me to the relevant code path(s)

nschneid commented 1 year ago

Just click "edit page" at the top!

dan-zeman commented 1 year ago

It also does not complain about either of the following two, but I think it should actually allow only the first form and ban the second one.

I have modified the validator to disallow the sequence 4 5-6 4.1 5 6 7.

dan-zeman commented 1 year ago

Documented.