Closed AngledLuffa closed 1 year ago
@dan-zeman could say for sure what the validator enforces, but
My understanding is that the MWT device is merely intended to explain the surface string for words that are merged by grammatical rule (such as a contraction) without a space. As such, an empty node could occur between two parts of an MWT, but shouldn't begin or end it—when it says the MWT is indexed with an integer range like 1-2
or 3-5
, I take it that excludes 3-5.1
. I don't think we need an MWT for a hypothetical contraction where one of the words is missing (in your example, for instance, it could be "it is" just as well as "it's"; a more difficult case would be "wo" understood as truncated "won't", but I would still say it's superfluous to call that an MWT).
I would assume that an empty node at the start of the sentence is 0.1
, yes.
Multiword tokens are things that exist on the surface (and do not directly correspond to nodes). Empty nodes are things that do not exist on the surface. Therefore, empty nodes cannot be part of multiword tokens.
That does not necessarily imply that the linear position of an empty node (which is currently not regulated by the EUD guidelines) cannot fall between two words that are members of a multiword token. I just tested that the validator will not complain about this:
4 regnode _ NOUN _ _ 0 root 0:root _
5-6 MWT _ _ _ _ _ _ _ _
5 x _ NOUN _ _ 4 nmod 4:nmod _
5.1 empty _ NOUN _ _ _ _ 4:nmod _
6 y _ NOUN _ _ 4 nmod 4:nmod _
7 regnode _ NOUN _ _ 4 nmod 4:nmod _
It also does not complain about either of the following two, but I think it should actually allow only the first form and ban the second one.
4 regnode _ NOUN _ _ 0 root 0:root _
4.1 empty _ NOUN _ _ _ _ 4:nmod _
5-6 MWT _ _ _ _ _ _ _ _
5 x _ NOUN _ _ 4 nmod 4:nmod _
6 y _ NOUN _ _ 4 nmod 4:nmod _
7 regnode _ NOUN _ _ 4 nmod 4:nmod _
4 regnode _ NOUN _ _ 0 root 0:root _
5-6 MWT _ _ _ _ _ _ _ _
4.1 empty _ NOUN _ _ _ _ 4:nmod _
5 x _ NOUN _ _ 4 nmod 4:nmod _
6 y _ NOUN _ _ 4 nmod 4:nmod _
7 regnode _ NOUN _ _ 4 nmod 4:nmod _
An empty node at the start of the sentence is 0.1
, yes.
Excellent, thanks for all of the clarifications. Can I propose adding these examples, or real ones using real words, to the documentation? I'll make a PR for it myself if someone would point me to the relevant code path(s)
Just click "edit page" at the top!
It also does not complain about either of the following two, but I think it should actually allow only the first form and ban the second one.
I have modified the validator to disallow the sequence 4 5-6 4.1 5 6 7.
Reading over some of the documentation for empty words, such as
https://universaldependencies.org/format.html
https://universaldependencies.org/u/overview/enhanced-syntax.html#ellipsis
I have a couple questions about corner cases with empty words, especially with MWT. Is it possible for there to be
it ['s] a good idea
, for example, withit 's
being an MWT even though's
isn't part of the sentence[it's] a good idea
, with all ofit's
being empty, marked as an MWTAlso, what is the index of an empty word at the start of a sentence?
0.1
?