UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

An appalling lemma discrepancy #480

Open AngledLuffa opened 10 months ago

AngledLuffa commented 10 months ago
# sent_id = reviews-071650-0012
# text = I was appalled.
1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   3:nsubj _
2       was     be      AUX     VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   3       cop     3:cop   _
3       appalled        appalled        ADJ     JJ      Degree=Pos      0       root    0:root  SpaceAfter=No
4       .       .       PUNCT   .       _       3       punct   3:punct _

vs

# sent_id = newsgroup-groups.google.com_INTPunderground_b2c62e87877e4a22_ENG_20050906_165900-0015
24      I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      26      nsubj:pass      26:nsubj:pass   _
25      am      be      AUX     VBP     Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   26      aux:pass        26:aux:pass     _
26      appalled        appal   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     14      ccomp   14:ccomp        _

Should this be appal instead of appall, anyway? Similar question for enrol, I suppose. It doesn't look right to my American sensibilities.

AngledLuffa commented 10 months ago

Thoughts on this? I'd like to switch it to the double letter versions

nschneid commented 10 months ago

Yeah definitely not "appal". I guess it should be a VERB, with lemma "appall".

AngledLuffa commented 10 months ago

That's not the only time there's I was [a-z]+ed with ADJ/JJ as the tag

# sent_id = newsgroup-groups.google.com_INTPunderground_b2c62e87877e4a22_ENG_20050906_165900-0048
# text = ... I was shocked at the lack of racial diversity.
30      shocked shocked ADJ     JJ      Degree=Pos      0       root    0:root  _

# sent_id = reviews-126171-0003
# text = ... but I was disappointed with their customer service.
16      disappointed    disappointed    ADJ     JJ      Degree=Pos      4       conj    4:conj:but      _

# sent_id = reviews-360937-0002
# text = I must say, I was impressed with the size ...
7       impressed       impressed       ADJ     JJ      Degree=Pos      3       ccomp   3:ccomp _

# sent_id = email-enronsent12_01-0069
# text = I didn't feel guilty about the garage sale, that's why I was annoyed - being notified at 10:00 at night GRRRRRRR.
16      annoyed annoyed ADJ     JJ      Degree=Pos      13      advcl:relcl     13:advcl:relcl  _

# sent_id = email-enronsent37_01-0105
# text = And I was relieved when Nicki called to let me know she was home safe.
4       relieved        relieved        ADJ     JJ      Degree=Pos      0       root    0:root  _

# sent_id = newsgroup-groups.google.com_herpesnation_c74170a0fcfdc880_ENG_20051125_075200-0011
# text = ... before I was finished being a teenager.
35      finished        finished        ADJ     JJ      Degree=Pos      28      advcl   28:advcl:before _

although not consistently so:

# sent_id = weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0249
# text = I was amazed at the spiel they delivered.
3       amazed  amaze   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     0       root    0:root  _

It looks like usually a state of mind is tagged as ADJ here

nschneid commented 10 months ago

Deciding VERB (past participle) vs. ADJ is very nuanced. I defer to @amir-zeldes when it comes up. (One test for ADJ is un- negation, which works for some of these. Another is very modification. I'm not sure there's a good reason that shocked should be ADJ but amazed should be VERB.)

amir-zeldes commented 10 months ago

It is indeed very tricky, since the original PTB guidelines are internally inconsistent/offer contradictory tests without a ranking. I do rely on PTB/ON precendents in doubtful cases, but my basic methodology and hopefully the one you find in practice in the GU corpora is articulated here, and I'll repeat it for convenience - the priorities of importance we have noticed in prior corpora and what we enforce is, in order:

  1. Negation, a.k.a. "no such verb" - if the item has negation that would preclude a verbal lemma, it must be an adjective (uncredited -> JJ). This outranks even a by-agent; exception: if the "un-" lemma exists as a verb (e.g. Santorini's example "untied" -> "untie"). Similar logic applies to the 'no VBG for no such verb' rule, e.g. "outgoing" cannot be VBG, because "outgo" is not a verb.
  2. By-agent - if it has an overt by agent, then it is VBN. Santorini originally ranked this below intensifiers, but the corpora seem to behave the opposite way so I have followed them and I think we've been pretty consistent about it. It's also in the spirit of prioritizing argument structure where possible.
  3. If a relative clause paraphrase is possible, prefer VBG/VBN, but if that changes the meaning, prefer JJ: "appetizing dish" ->*dish which appetizes -> JJ; existing safeguards ->safeguards that exist -> VBG
  4. If "get" is possible but "become" is bad, prefer VBN - Santorini's "I was/got/*became married"
  5. Prefer JJ for state if it has a different reading from the event (e.g. "I was mistaken/JJ" != "someone mistook me")

In OntoNotes, it seems "impressed" is about 50-50 when not used in a perfect construction/as a finite verb, and tagged VBN whenever "by" appears (criterion 2), otherwise JJ. Criterion 4 would justify this behavior IMO, but we have 3. ranked higher, which is why it is always VBN in GUM in this function. Personally I would opt for VBN here, it seems pretty transparent to me.

nschneid commented 10 months ago

ON has both tags for lowercased "united", though JJ is the majority tag. For the criteria in the link, the relative paraphrase decides it for VBN ("united people" can be "people who are united by ...")

_Originally posted by @amir-zeldes in https://github.com/UniversalDependencies/UD_English-GUM/issues/78#issuecomment-1847631652_

If test 3 permits adding a copula, then "united states" => "states that are united" clearly passes, but so would canonical stative adjectives ("tall people" => "people who are tall"). So I'm not sure how that would favor the verb analysis.

If it doesn't permit adding a copula, then "states that united" is not quite the same meaning, though it is related by the inchoative alternation.

amir-zeldes commented 10 months ago

but so would canonical stative adjectives

Yes, this is true, but it's more consistent with VBG (uniting factors -> factors which unite), which can always be construed that way, and I don't see why active participles get to stay verbs in this use when passives don't. For me the question is mainly one of the lexeme (lemma), and note that not all participle-like forms pass this test, for example "missing documents" are not "documents which miss".

nschneid commented 10 months ago

Well there's a semantic difference—"uniting" is dynamic, "united" is stative. Paraphrasing as a finite verb would usually favor the dynamic reading except with verbs that are inherently stative like "exist".

rueter commented 10 months ago

Hi, @nschneid ! Couldn't this also be understood as partially aspect. I'm thinking of the -ing participle vs the -ed participle. So, the -ing participle is imperfect and -ed perfect. How does the word "unite" fit into this, you ask? Well, maybe, "united" is simply the perfect aspect, such that ongoing is dynamic when completed, the result, is stative. I would see both -ing and -ed participles as possibly verb-based.

amir-zeldes commented 10 months ago

Well there's a semantic difference—"uniting" is dynamic, "united" is stative. Paraphrasing as a finite verb would usually favor the dynamic reading except with verbs that are inherently stative like "exist".

Right, but even for "existing" if I search for DT followed by it, I get 36:3 VBG... So it seems at least for active participles, PTB recognizes the deverbal nature regardless of lexical stativeness

nschneid commented 10 months ago

I'm willing to change EWT's treatment of "united" to VERB/VBG per GUM policies. But there are likely other lexical items in EWT that are not consistent with GUM.

amir-zeldes commented 10 months ago

OK thanks - I wouldn't be surprised if there are also internal inconsistencies within GUM and EWT, it's probably a longer term to do but worth looking at at some point.