UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Validation for the "node has more than one subject" constraint #657

Closed olesar closed 4 years ago

olesar commented 5 years ago

In sentences like The only answer is 'Nobody knows'. in which the the head of the reported speech is the root of the sentences, both 'answer' and 'nobody' are considered subjects: nsubj(Nobody, knows) nsubj(answer, knows)

This double subject construction (also with named entities that have internal syntactic structure) is attested in many languages but is treated as a 'Violation of guidelines' by the current validation.py script. See also @amir-zeldes's example: It's not [that I can't go] nsubj(go, I) nsubj(go, it)

jnivre commented 5 years ago

The guidelines say that the copula should be treated as the root when the predicate (in this case ”nobody knows”) is itself a clause. The motivation is precisely to avoid having to ”nsubj” relations. This is not an ideal solution and it may be worth reconsidering in future versions. In any case, this explains why this is flaggen as an error by the validator.

olesar commented 5 years ago

Unfortunately, some languages do not have overt copulas. What should one practically do, say,  in Russian or Belarusian? Is the subject a root then? I also recollect an example where both subject and predicate were book titles with the verb head.

jnivre commented 5 years ago

Good point, which underlines the questionable nature of this specific guideline. Making the subject the root seems even more questionable, so in these cases we may have to accept two nsubj relations.

nschneid commented 5 years ago

Interestingly, looking at the GUM corpus, the largest proportion of these involve a shell noun ("the question/answer/problem/issue/reason/claim/agreement... is"). (Another chunk involve free relative clauses, but I'm not sure whether the parses are correct.)

The shell noun sentences can be reversed without changing the semantics: "The issue is that the judge erred" = "That the judge erred is the issue"; "The solution is a new trial" = "A new trial is the solution". I wonder if it would make sense to treat the shell noun as the predicate regardless of word order, thus avoiding a double nsubj or having to make the copula the root:

The issue is that the judge erred / That the judge erred is the issue root(issue) cop(issue, is) csubj(issue, erred) nsubj(erred, judge) mark(erred, That)

I don't know how English-specific this phenomenon is and whether that solution would work in other languages.

amir-zeldes commented 5 years ago

+1 for allowing double subject in this construction. Plain dependencies can't really express the nesting structure of the clause internal vs. clausal predicate subjects, so I think this is the most faithful thing we can do. In practice, it's usually not hard to tell which subject is which based on proximity to the verb.

@nschneid - I feel like csubj is putting a lot of potentially non-theory neutral interpretation into the analysis: in English both orders are possible around a copula, and by convention we assume it's 'subject first' (unless there's some agreement evidence to the contrary). This has the added bonus of being able to easily distinguish the two constructions based on dependencies. Making cop the head is less good due to the issue that @olesar pointed out, and I think you'd get the same problem in Semitic languages.

Incidentally, I think the reason we can't get shell-noun agreement evidence to support the nsubj analysis is that we happen not to have shell-nouns that are plurale tantum. For wh-clauses it's possible to show that the order is important, and the first phrase is normally interpreted as a subject, at least in my English:

The pants are what ripped. 
What ripped is the pants.  
? What ripped are the pants.

The last sentence would be the standard rendition in German, and there I would assume a post-verbal subject, but that's not unusual for German anyway.

nschneid commented 5 years ago

Subject-verb agreement between singular and plural NPs in a copular construction is always dicey. "The pants are what ripped" sounds fine to me. "What ripped was/were the pants" sounds better than "What ripped is/are the pants", but still a bit awkward. Also awkward: "My favorite piece of clothing is/are my pants" (I think I'd prefer "is" to "are", but am not sure).

dan-zeman commented 5 years ago

Okay, so how do we modify the validator without allowing multiple subjects everywhere? Is it safe to say that a VERB has at most one subject, and any other part of speech has at most two?

jnivre commented 5 years ago

Sounds good to me. Ideally, one of the subjects should be subtyped, but since subtypes are never obligatory in v2, we cannot make this part of the validation.

However, before we modify the validator, we need to agree on exactly how the guidelines should be amended. It seems straightforward to say that the current guidelines (which makes the copula the root if the predicate is a clausal structure) cannot be applied to languages/sentences with no overt copula. But allowing it also in cases where there is an overt copula amounts to a change of the guidelines, which (under our current policies) cannot happen until v3.

olesar commented 5 years ago

I would not turn off the validation of the double subject trees entirely but rather use a warning mode along with 'errors'. In a number of treebanks, it would help to correct erroneous cases such as depictive sentences with two Nominative-like dependencies of the verb.

dan-zeman commented 5 years ago

@jnivre, I would only extend the guidelines to the point that was not covered by the current documentation, i.e., what to do if there is no overt copula. The guidelines can be stricter than the validator if there is something that the validator cannot check automatically (but it actually can check the presence of the copula). On the other hand, I was probably wrong in claiming that the double-subject situation cannot occur if the parent node is a VERB. The nested clause that functions as a predicate can itself be verbal.

But we could restrict the possibility of having two subjects to certain languages. Languages that always use a copula should never have more than one subject.

@olesar, do you have a Russian, Belarusian or other example that lacks an overt copula and that could be put in the guidelines?

amir-zeldes commented 5 years ago

@dan-zeman sorry to reopen the discussion but upon thinking about this for a bit, I'm not sure I understand the rationale to recommend a different analysis when there is or isn't a copula. One of the strengths of the UD treatment of copulas as auxiliaries is, in my opinion, the fact that nominal sentences without copulas look the same as ones with it. Why make a distinction when the predicate is clausal? If we tolerate double subjects in some cases, why not make this be one as well? I think it would be more consistent.

As for examples, Hebrew allow this, here are some natural examples I found online with literal translations:

I think you can get this structure in Polish as well:

jnivre commented 5 years ago

The problem is that the recommendation to make the copula the head when the predicate is clausal was adopted as part of v2 of the guidelines, and we are currently operating under the constraint that universal guidelines can only be changed as part of a major version change (which in this case would have to be from v2 to v3). The added guidelines for languages without a copula is different, because this was a clarification of what to do in which case the main rule does not apply, and this is allowed under the current policies.

It may feel frustrating to not be able to change the guidelines between major versions when you are convinced that something is wrong with the current guidelines – and I for one share your opinion that the guidelines for nominal clauses has serious problems – but the alternative of allowing changes at any time would make it really hard for people to keep up with the guidelines and to keep their data sets consistent with the current guidelines.

amir-zeldes commented 5 years ago

Yes, I can definitely understand that logic. My worry is that, for guidelines that will ultimately change, some new data sets may enter with the 'wrong' structure and then be abandoned without updates, rather than entering with the right structure to begin with. But I admit in this particular case it is probably relatively easy to recognize the construction and convert it automatically if needed.

coltekin commented 5 years ago

I'm joining to the discussion after it was cooled down, but I just realized that this is (related to) a problem I also have been struggling with for a long time. So, adding another data point.

Here is an example from Turkish:

Sorun        Alinin      duyamaması
problem      Ali         not-being-able-to-hear
"The problem is (that) Ali cannot hear"

The verbal noun has its subject in the subordinate clause and the subject of the copula out of the subordinate clause. In Turkish treebanks, this problem is mostly circumvented by splitting the copular suffixes, but in some of the cases like the one above, the suffix is null, there is nothing to split. So, I do not know of a way to annotate the above sentence (not even an ad hoc solution like marking the copula as the head).

My linguistically naive preference would be to distinguish 'copular subjects' from others either by a subtype or, maybe better, by introducing a new relation, but I would also be happy with a/any general solution to this problem.

dan-zeman commented 5 years ago

The optional relation subtype is nsubj:cop, currently used in Breton, Estonian, Finnish, Hebrew, Sanskrit.

rueter commented 4 years ago

Unfortunately, some languages do not have overt copulas. What should one practically do, say, in Russian or Belarusian? Is the subject a root then? I also recollect an example where both subject and predicate were book titles with the verb head.

In the equation X is Y, the subject is automatically assigned to the first NP. Semantically we might be dealing with some discourse-oriented phenomenon.

(1a) The only answer is 'Nobody knows'. (1b) The only answer is 'that Nobody knows'. (2) 'Nobody knows' is the only answer.

(2) would be nsubj(knows, Nobody) csubj(knows, answer) Or should someone be speaking of parataxis?

@olesar I was actually thinking that the English it should be in an expletive dependency relation. Below is an example I found in reverso (I wasn't expecting the чтобы subjunctor).

https://context.reverso.net/перевод/английский-русский/it+is+not+that

amir-zeldes commented 4 years ago

I'm not sure it's expl in "It's not that I don't like him". If it's expl, then it should mean the same thing if the clause were the subject:

"It is not that I don't like him" = "That I don't like him is not" (this is similar to Polish pseudo-clefts with to: "it's the case that CLAUSE")

I think this is more referential, it usually means something like "The reason is not that I don't like him".

In any case, unless the guidelines for nested clauses are changed, I think some constructions will inevitably require multiple subject nodes.

dan-zeman commented 4 years ago

In the equation X is Y, the subject is automatically assigned to the first NP.

In English, perhaps. Other languages do not follow the English word order rules and are free to decide that the clause is csubj, even if it appears second.

nschneid commented 4 years ago

@amir-zeldes:

I'm not sure it's expl in "It's not that I don't like him". ... I think this is more referential, it usually means something like "The reason is not that I don't like him".

Are we defining expletive "it" in English as the byproduct of the cleft construction, where the subordinate clause can be substituted for "it" as a clausal subject, and nothing else? (related: #461)

I think that would rule out, e.g., the EWT examples "Budgies are breast feeding birds and it may be the male is bisexual" and "as it is with Kerry, ....".

Similarly vague "it" can also appear postverbally ("Did you make it to the party?", "long lines on the weekend but worth it."). These appear in EWT to be inconsistent, sometimes expl, sometimes obj. With "Is it worth it to go through that kind of staff", the both "it"s are treated as expl, but only the first corresponds to the cleft, so perhaps the second one should be obj.

amir-zeldes commented 4 years ago

@dan-zeman +1 - just as an example, Coptic has two main copula word orders:

The first one is considered 'canonical', but in our data, the so-called 'postponed subject' is actually more common.

amir-zeldes commented 4 years ago

@nschneid The distinction of expl is more semantic than syntactic for me: I don't really see a difference between "it rained" and "Zeus rained" syntactically, except that "it" is not referential in the first case.

I definitely agree with "it" as expl in "it rains", as it is not referential. I'm not so sure about "it's not that..." because you could interrogate it:-"it's not that I don't like him" "what is it then?", and you can use it to pronominalize a noun like "reason" in context, and you can't replace the "it" with a fronted clause in the normal subject position, which is possible for the raising cases.

I don't feel very strongly about it though, this "it" is definitely not a highly referential one, it's not totally ruled out as referential, as in the case of "it rained" (* "what rained?")

nschneid commented 4 years ago

The distinction of expl is more semantic than syntactic for me

This makes me wonder if we shouldn't be treating expletives as semantically-defined subtypes of ordinary grammatical relations like subject and object.

amir-zeldes commented 4 years ago

That seems like a pretty big change, since it would affect a major deprel and therefore potentially almost all treebanks. It would be nice in that you could tell whether it was a subject or object expletive, but my preference is to 'not rock the boat' unless there is a very compelling reason :)

sylvainkahane commented 4 years ago

In SUD, the Surface-Syntactic version of UD, we decided to clearly separate surface-syntactic relations and semantic features. In consequence of that, UD expl such as it in it is possible that he left is encoded subj@expl in SUD and UD csubj between possible and left becomes comp:obj@agent (where comp:obj is a generalisation on ccomp). We have implemented a conversion rule to come back from SUD to the UD annotation. (We also have a rule that converts a UD pair expl-csubjinto SUD subj@expl-comp:obj@agent, which is ok for English or French, but could be problematic for other languages.) I agree with @amir-zeldes that it's a pretty big change, but as soon as we have conversion tool that allows us to come back to the previous version, it is possible to consider such big changes. In particular, if we introduce such big changes in UD v3, we must propose a conversion tool to come back to UD v2 and to be compatible with treebanks that will never switch to v3.

amir-zeldes commented 4 years ago

Big changes are not just about scripts though: there are also all the students and treebank annotators we teach UD to every year who have learned one version and then discover that some (but probably not all) resources have been changed. It can be very hard to keep track of.

By contrast, everyone knows the PTB tagset for English has a lot of broken guidelines, and IMO even outright bad decisions in several places, but anyone who works with tagged English corpora knows it and knows how to deal with them, because we've learned it and it's stable. I'm not saying we should never change anything, but I think the bar should be high now that UD has hundreds of contributors and over 100 treebanks. Changing expl to nsubj:expl and obj:expl sounds great to me, but someone would actually have to go into all treebanks and disambiguate all expl, which might not be simple in many languages.

KoichiYasuoka commented 4 years ago

I've read your SUD, @sylvainkahane, but I'm vague how we apply SUD to the predicate-object-final structure of Classical Chinese. Here we consider an example sentence 信斯言也是周無遺民也 from UD_Classical_Chinese-Kyoto.

# sent_id = KR1h0001_009_par4_280-289
# text = 信斯言也是周無遺民也
1       信      信      VERB    v,動詞,行為,態度        _       7       csubj   _       Gloss=believe|SpaceAfter=No
2       斯      斯      PRON    n,代名詞,指示,* PronType=Dem    3       det     _       Gloss=this|SpaceAfter=No
3       言      言      NOUN    n,名詞,可搬,伝達        _       1       obj     _       Gloss=speech|SpaceAfter=No
4       也      也      PART    p,助詞,提示,*   _       1       mark    _       Gloss=that-which|SpaceAfter=No
5       是      是      PRON    n,代名詞,指示,* PronType=Dem    7       expl    _       Gloss=this|SpaceAfter=No
6       周      周      PROPN   n,名詞,主体,国名        Case=Loc|NameType=Nat   7       nsubj   _       Gloss=[country-name]|SpaceAfter=No
7       無      無      VERB    v,動詞,存在,存在        Polarity=Neg    0       root    _       Gloss=not-have|SpaceAfter=No
8       遺      遺      VERB    v,動詞,行為,得失        VerbForm=Part   9       amod    _       Gloss=leave-behind|SpaceAfter=No
9       民      民      NOUN    n,名詞,人,人    _       7       obj     _       Gloss=people|SpaceAfter=No
10      也      也      PART    p,助詞,句末,*   _       7       discourse:sp    _       Gloss=[final-particle]|SpaceAfter=No

KR1h0001_009_par4_280-289

This sentence 信斯言也是周無遺民也 consists of two clauses, 信斯言也 and 是周無遺民也. The second clause 是周無遺民也 has a smaller clause 周無遺民 inside. The first clause 信斯言也 does not have a subject. The second clause 是周無遺民也 is a copular clause such as 是X也 which means "是(this) is X", and 是 is the expletive subject which leads the first clause 信斯言也. Thus in UD, three subjects are linked from the verb 無, nsubj to 周 for the clause 周無遺民, expl to 是 for the clause 是周無遺民也, and csubj to 信.

In the predicate-object-final structure model of Classical Chinese, 信斯言也 is divided into 信-斯言-也 where 信 for "predicate", 斯言 for "object", and 也 for "final". Also 是X也 is divided into 是-X-也 where 是 for "predicate", X for "object", and 也 for "final". However, UD considers that every copular clause has a subject (not an object), so we treat 是 as a subject for 是X也 in UD.

Then, how do we annotate the sentence 信斯言也是周無遺民也 in SUD?

sylvainkahane commented 4 years ago

@amir-zeldes I agree with you. I don't think it is now possible to leave UD v2, because we have too many ressources in this format and too many users. Which means that people who want to use a different format must be sure that their format is equivalent or richer than UD v2 and convertible into UD v2. It is what we try to do with SUD.

@KoichiYasuoka I cannot answer to your question. I don't understand your example (I would need at least a complete translation) and I don't understand your annotation in UD. I don't understand how this sentence can have two subjects and in addition an expletive which, if I understand correctly, is also a surface-syntactic subject. In any case if you want to see how this sentence has been translated into SUD, you can check it on the Grew platform. All UD treebanks have been converted into SUD and are avalaible. We do not apply a rule for expl in languages we don't know. There are 5 verbs in your treebank with two subj and one expl.

gossebouma commented 4 years ago

@nschneid writes:

The distinction of expl is more semantic than syntactic for me This makes me wonder if we shouldn't be treating expletives as semantically-defined subtypes of ordinary grammatical relations like subject and object.

This is one option we considered in Bouma et al, Expletives in Universal Dependency Treebanks (UDW 2018). In the end, we opt for a slightly less radical approach, where expl relations are subtyped. It is equally informative but maintains more clearly the core-adjunct distinction.

ikazos commented 4 years ago

Hi all, I'm just a college student learning about UD and trying to understand the ongoing discussions about UD, and I was curious about the Classical Chinese example presented by @KoichiYasuoka, which is the sentence 信斯言也是周無遺民也. I speak Japanese and I studied a bit of Classical Chinese in school. Please correct me if I'm wrong in any of my following discussion.

First of all, the data. The sentence seems to come from work by the Chinese philosopher Mencius, and the context is that Mencius is talking to one of his disciples about the danger of quoting people out of context and misunderstanding them. To support his point, Mencius gives the example of a poem, one of whose verses says that "all the people of the Zhou Dynasty has left their country." Mencius says that this verse is supposed to be taken figuratively, not literally -- it is not the case that every single person in the Zhou Dynasty left the country.

In Classical Chinese, the sentence-final particle 也 functions as a declarative / indicative marker. As @KoichiYasuoka has pointed out, the sentence 信斯言也是周無遺民也 comprises two declarative / indicative clauses: 信斯言也 and 是周無遺民也. Disregard the meaning of 是 for now, I have provided the glosses for the two clauses in (1) and (2). Of the two English translations, the first one is more literal, and the second one is more of a free, semantic translation. Considering the context, I believe that the whole sentence translates to a conditional construction like (3). The punctuation is based on the source of the text (link above).

1.  信      斯   言     也
    believe this speech PARTICLE
    '(One) believes this speech.'
    '(One) believes such words.'

2.  是   周   無   遺          民     也
    this Zhou NEG leave.behind people PARTICLE
    'This Zhou does not leave people behind.'
    'This country, the Zhou Dynasty, loses all its people.'

3.  信斯言也,是周無遺民也。
    'If one is to take such words literally,
     then it would mean that the Zhou Dynasty has lost all its people.'

With that, I have some questions for the example you provided. Please excuse my ignorance in Classical Chinese / UD.

  1. Is the sentence a conditional construction?
  2. There are two occurrences of 也 in this sentence, and I took both to be final declarative / indicative particles. However, you have different glosses for the two 也s in your example. Is the first one not a particle?
  3. You said that 是 is an expletive that leads the first clause 信斯言也. Do you mean that 是 and the first clause 信斯言也 are in the same relationship as it and that we burned the potatoes are in the English sentence It was frustrating that we burned the potatoes, which is a case of extraposition? I thought that 是 is a demonstrative that means this, and together with 周 'Zhou Dynasty' it meant 'this Zhou Dynasty'. But perhaps 是...也 is a special kind of construction in Classical Chinese that is used for conditional constructions?

Quoting your example for convenience:

# sent_id = KR1h0001_009_par4_280-289
# text = 信斯言也是周無遺民也
1       信      信      VERB    v,動詞,行為,態度      _                       7       csubj   _       Gloss=believe|SpaceAfter=No
2       斯      斯      PRON    n,代名詞,指示,*       PronType=Dem            3       det     _       Gloss=this|SpaceAfter=No
3       言      言      NOUN    n,名詞,可搬,伝達      _                       1       obj     _       Gloss=speech|SpaceAfter=No
4       也      也      PART    p,助詞,提示,*         _                       1       mark    _       Gloss=that-which|SpaceAfter=No
5       是      是      PRON    n,代名詞,指示,*       PronType=Dem            7       expl    _       Gloss=this|SpaceAfter=No
6       周      周      PROPN   n,名詞,主体,国名      Case=Loc|NameType=Nat   7       nsubj   _       Gloss=[country-name]|SpaceAfter=No
7       無      無      VERB    v,動詞,存在,存在      Polarity=Neg            0       root    _       Gloss=not-have|SpaceAfter=No
8       遺      遺      VERB    v,動詞,行為,得失      VerbForm=Part           9       amod    _       Gloss=leave-behind|SpaceAfter=No
9       民      民      NOUN    n,名詞,人,人          _                       7       obj     _       Gloss=people|SpaceAfter=No
10      也      也      PART    p,助詞,句末,*         _                       7       discourse:sp    _       Gloss=[final-particle]|SpaceAfter=No

Thank you very much.