UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

"It seems" #34

Closed GPPassos closed 6 years ago

GPPassos commented 7 years ago

There are some cases in the corpus of "it seems" in which it occurs as expl(seems,it) and others as nsubj(seems, it).

I understand that nsubj(seems, it) is correct when "it" is a pronoun with a referent, but I'm confused by some specific sentences.

For instance:

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0011
# text = Most people haven't gone to work the last few days, although it seems that the rest of Baghdad is 'normal' (if you can define what normal is).
14  it  it  PRON    PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  15  nsubj   _   _
15  seems   seem    VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   5   advcl   _   _

For an inverse case, doesn't this first "it" have a referent (although not present in this sentence, and the same of the second "it")?

# sent_id = reviews-018548-0004
# text = it seems like its healthier too, but its prolly not.
1   it  it  PRON    PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  2   expl    _   _
2   seems   seem    VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
3   like    like    SCONJ   IN  _   6   mark    _   _
4   it  it  PRON    PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  6   nsubj   _   SpaceAfter=No
5   s   be  AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   6   cop _   _
6   healthier   healthier   ADJ JJR Degree=Cmp  2   csubj   _   _
7   too too ADV RB  _   2   advmod  _   SpaceAfter=No
8   ,   ,   PUNCT   ,   _   12  punct   _   _
9   but but CCONJ   CC  _   12  cc  _   _
10  it  it  PRON    PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  12  nsubj   _   SpaceAfter=No
11  s   be  AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   12  cop _   _
12  prolly  prolly  ADV RB  _   2   conj    _   _
13  not not PART    RB  _   12  advmod  _   SpaceAfter=No
14  .   .   PUNCT   .   _   2   punct   _   _

Is this correct?

Thank you.

nschneid commented 7 years ago

I think the first example clearly should be expl.

For the second example the first "it" strikes me as ambiguous.

sebschu commented 6 years ago

Hi @GPPassos,

Yes, in the first example it should definitely be expl and I just fixed that.

And I also agree that the it in the second example should be nsubj (if you change that sentence to it seems to be healthier too ... you potentially get a different meaning, which suggests that it here is not a syntactic expletive.

jnivre commented 6 years ago

I would say the second sentence is simply ambiguous. Try replacing the two occurrences of "it" by "this":

this seems like it is healthier too it seems like this is healthier too

Both are fine grammatically but they correspond to different readings, the first with referential "it", the second with expletive "it".

sebschu commented 6 years ago

I think most generative grammarians would still not consider the it in your second example as a true syntactic expletive, but if we are making the distinction based on referentiality, then I agree that there is a reading in which it should be an expletive.

jnivre commented 6 years ago

Why wouldn't it be expletive? It is non-referential and is not assigned a semantic role (because it is the subject of a raising verb)? We really need a working group on expletives. :)

dseddah commented 6 years ago

Hi all, in those cases where two readings are available and not disambiguation possible at the sentence level, wouldn't it make sense to allow double labeling ? like nsubj|expl ?

arademaker commented 6 years ago

@dseddah another simpler possibility with less impact on the conllu format would be to allow one or more analysis for each sentence, possible using the sentence metadata for indicating the versions.

dan-zeman commented 6 years ago

Double labeling will add to processing complexity, may "solve" some difficult annotation cases, but it can hardly solve all of them because sometimes you may need to change the structure too.

jnivre commented 6 years ago

I completely agree. A cost-benefit analysis will easily tell you that it is not worth it. The whole idea of a treebank (or any labeled corpus) is that we provide contextually disambiguated readings. Otherwise, we can just write a grammar. :)

dseddah commented 6 years ago

Le 25 oct. 2017 à 12:24, Joakim Nivre notifications@github.com a écrit :

I completely agree. A cost-benefit analysis will easily tell you that it is not worth it. The whole idea of a treebank (or any labeled corpus) is that we provide contextually disambiguated readings. Otherwise, we can just write a grammar. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

I beg to disagree :) even the PTB guidelines included the notion of pseudo attachements for structural ambiguities (page 102, [1] ). With a quick digging there was even proposals for underspecification in dependency label assignments for difficult cases [2]. and in our own work on live video games chat log annotation, there’s sometimes no way to disambiguate between 2 structures ( see fig. 3 in [3]).
We talked about it once at Coling last year and you suggested that maybe in particular cases it could be justifiable to include two concurrent analysis of the same sentence. I think it’s a cool idea but what if they’re more analysis ? at some points, having the possibility of representing a shared forest should be possible. Having treebanks for training parsers is of course very nice but we shouldn’t be forced to arbitrarily disambiguate if we can avoid it. (which is not yet the case except with this duplicate sentence hack)

Best, Djamé

[1] http://groups.inf.ed.ac.uk/switchboard/TreebankII.pdf [2] http://www.aclweb.org/anthology/W08-1303 [3]http://pauillac.inria.fr/~seddah/wnut_ExtremeUGC.pdf

dseddah commented 6 years ago

oups, I missed that post. of course, you’re right.

Le 25 oct. 2017 à 12:20, Alexandre Rademaker notifications@github.com a écrit :

@dseddah another simpler possibility with less impact on the conllu format would be to allow one or more analysis for each sentence, possible using the sentence metadata for indicating the versions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sebschu commented 6 years ago

@jnivre Coming back to the original issue, you are totally right. I remembered that there was something special about these copy raising constructions (as it turns out that they are commonly called), and there are some disagreements on where the expletive is generated but everybody seems to agree that the it is an expletive in these constructions.

However, after having thought more about this (and having read Asudeh & Toivonen, 2012), I think the embedded clause should actually be ccomp or advcl (depending on whether it's introduced by that or like/as if/...) rather than being a csubj.

(This wasn't part of the discussion so far but I incorrectly assumed that whenever we have an expletive subject and an embeeded that clause, then the embedded clause should be a csubj)

As far as I can tell, the current documentation doesn't have anything to say about these cases but there is a difference between seems constructions with an adjectival complement as in (1) and without as in (2).

(1) It seems clear that we should decline (2) It seems that the rest of Baghdad is normal.

(1) is very similar to (3), and in both cases we can move the embedded clause into subject position (at least I think so, both of these sound a bit strange to me but I think they are nevertheless grammatical).

(3) It is clear that we should decline (4) That we should decline seems/is clear

The complement of (2), on the other hand, can never be in subject position.

Let me know if you agree with this analysis and then I'll add a few sentences to the documentation.

jnivre commented 6 years ago

Yes, I think you are right, and I think we need to review the use of expl more generally. Leaving alone its use for reflexives (and other obj positions), we seem to be using if for two essentially different types of constructions. The first is the classic weather examples, where the predicate does not have any semantic arguments, but where syntax in many languages requires a syntactic subject. Hence:

(1) *rains (2) it rains

In this case, the expletive is clearly a syntactic subject, and it is not a "proxy" for any other constituent in the sentence. With hindsight, I think it may have been more correct to label this "nsubj:expl", because it unquestionable is a syntactic subject (as shown, for example, by the tag question test: "it rains, doesn't it").

The second type of use is where there is an element filling the semantic argument role corresponding to subject, but where the expletive is introduced to allow that argument to move to another position in the sentence, typically because of heaviness or factors related to information structure. Examples:

(3) that it rained surprised me (4) it surprised me that it rains

(5) a cat is on the mat (6) there is a cat on the mat

Here, UD takes the view that what is the nsubj in (3) and (5) is also the nsubj in (4) and (6), so here the expletive is added in the position of the syntactic argument to allow it to "move" elsewhere. However, the question is whether the "moved" element retained its syntactic function. Some syntacticians I have talked to do not accept that the "that it rains" and "a cat" are subjects in (4) and (6). For example, the tag question test again points to the expletive being the subject, and the other constituent would then be a complement of some kind. In other words, there would not be a common nsubj to (3)-(4) and (5)-(6), there would only be a common semantic argument.

Now, in the case of "seems" (without an adjectival complement), there is not even an alternation between having the semantic argument in subject position or not, so it becomes even more questionable to call the clause a csubj, but I think many people would argue that it is in principle the same case as above.

In conclusion, I am slightly worried about starting to patch the guidelines before we have done a more systematic survey of these constructions.

sebschu commented 6 years ago

In conclusion, I am slightly worried about starting to patch the guidelines before we have done a more systematic survey of these constructions.

Agreed. Maybe we should also ban expletives :)

jnivre commented 6 years ago

I second that. :)

manning commented 6 years ago

My comments: Agree with @nschneid / @jnivre That first example is expl (and was wrong) while second is (in principle) ambiguous.

As a short reading of the "traditional generative" analysis of seem, this stackexchange post by John Lawler seemed good to me - I guess he's retired and has a lot of time to work on these kinds of things. :)

https://english.stackexchange.com/questions/97541/what-is-the-difference-between-seems-like-seems-that-seems

Note that he does argue for the clausal argument of seem to be subject, even though I agree that it-extraposition seems all but mandatory on these cases.

Re @jnivre's most recent long post (Oct 26): I would agree that the analysis in SD/UD (English) where we take the "logical subject" argument as nsubj in cases of the presence of an expl subject is non-standard with respect to most generative syntax, which would regard these arguments as "some kind of complement". However, I think it seemed a very appealing analysis to me and @mcdm: Firstly, in (6), if you can make the cat some kind of complement in phrase structure, you can appear happy, but if you have to name it as a complement in UD, the choices seem pretty ugly. Neither dobj or obl appears very appropriate, whereas still treating the logical subject as nsubj seemed to us quite nice.... Secondly, it is sort of convenient with respect to applications like relation extraction, since you get parallel analyses, as you note. However, I can see the arguments for having an nsubj:expl and dobj:expl instead, and it is more consistent with the syntactic mainstream.

I agree with @jnivre that we shouldn't try to annotate alternative analyses in treebanks. Indeed, I take the PTB experiences (where they had notations for alternative assignments of POS and syntactic structure) as an example of failure - these facilities were very rarely (and inconsistently used) and so the result wasn't worth the complexity of having them. Both approaches are conceptually justifiable, but that one seems to me the clearest and the sweet spot. For many sentences (with PP attachments, etc.) you could argue that multiple readings are syntactically possible. But what a treebank does is annotate the one that is correct in the context of a text. Now, usually that is 99.9% clear, but sometimes not so clear. If you view your job nevertheless as to annotate sentence s with tree t s.t. t = argmax_t' P(t'|s), then you should choose one tree as your best guest. That sounds good to me. To argue for marking ambiguity, you have to adopt some more complex statement and say something like that you will annotate the set of trees { t | P(t|s) > 0.1 }.

amir-zeldes commented 6 years ago

I agree with @manning, and let me add that I think those decisions are also better for cross-linguistic compatibility, since all languages are guaranteed to have the semantically filled representation in these constructions, whereas the expletive may or may not be there.

jnivre commented 6 years ago

Good point. We should make sure to remember this for the new and improved guidelines for core arguments and expletives that we are hoping to create soon. :)