UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

kinda wrong #64

Closed AngledLuffa closed 7 months ago

AngledLuffa commented 1 year ago

Literally, kinda is wrong:

# sent_id = GUM_conversation_blacksmithing-24
# text = That's another thing too, is I kinda had a b- general idea, of kinda how to do it, just watching him.
16      of      of      ADP     IN      _       18      case    18:case Discourse=elaboration-additional:56->55:0
17      kinda   kinda   ADV     RB      Degree=Pos      18      advmod  18:advmod       _
18      how     how     SCONJ   WRB     PronType=Int    14      nmod    14:nmod:of      _
19      to      to      PART    TO      _       20      mark    20:mark _
20      do      do      VERB    VB      VerbForm=Inf    18      acl     18:acl:to       _
21      it      it      PRON    PRP     Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  20      obj     20:obj  Entity=(4-event-giv:act-cf2-1-ana)27)|SpaceAfter=No

This should in general be split into kind of, an analysis which happens in

# sent_id = GUM_conversation_blacksmithing-38
19      every   every   DET     DT      PronType=Tot    20      det     20:det  Entity=(49-object-new-cf6-2-sgl
20      kind    kind    NOUN    NN      Number=Sing     17      obl     17:obl:through  SpaceAfter=No|XML=<w>
21      a       of      ADP     IN      _       22      case    22:case XML=</w>
22      ligament        ligament        NOUN    NN      Number=Sing     20      nmod    20:nmod:of      Entity=49)|SpaceAfter=No
AngledLuffa commented 1 year ago

... although it should be noted that even the correctly split kinda is not marked as an MWT, whereas words such as gonna, wanna, and ligma get marked as MWT. Surely the MWT annotation is more correct

amir-zeldes commented 1 year ago

The inconsistency is definitely kinda wrong :)

But in terms of our intentions, "kinda" is meant to be a single token in GUM, as it is in other LDC corpora - if you look at the whole corpus you'll also notice it's 20:1 in favor of the single token analysis, so it's just one case that slipped between the cracks. Will fix upstream of course.

amir-zeldes commented 1 year ago

Ooh, I just realized why this one case was done differently: because it isn't the adverb kinda/RB! It's actually a real fusion of the non-lexicalized precursor where it's a noun governing a PP:

(so it's not "kinda a ligament" = "approximately a ligament")

AngledLuffa commented 1 year ago

That one is specifically different in terms of the ligament, but it does seem like in general this should be a amalgamation of "kind of". A similar thing would happen with "sorta" -> "sort of" if that ever showed up in GUM, although I don't see it here. "sorta" does show up in EWT one time, where it is split into "sort of". At any rate, I believe it is inconsistent that "sorta" gets split but "kinda" does not

amir-zeldes commented 1 year ago

I think probably we should handle "sorta" as one token as well. It's not that I have a super strong feeling about what is more 'correct' (is it a new word? Is it right to treat it 'etymologically'?). I basically just want things to stay consistent across datasets, and it looks like past corpora have been leaning into "kinda" as a new word, so it makes sense to me that "sorta" would be the same.

It doen't appear in PTB, the nearest I could find is the British National Corpus and COCA (one token, lots of instances), and one occurrence in the HCRC map task corpus where it was split. But in sum I would vote to keep these lexicalized adverbials together, especially since they would just be turned into fixed + ADV otherwise anyway.

nschneid commented 1 year ago

"sorta" = sort + a (lemma "of") in EWT. Personally I don't have a strong opinion about the "right" way to handle these informal contractions (also "gotta", "gonna", "wanna", "coulda"...) but the Penn annotators seem to want to separate them.

nschneid commented 1 year ago

EWT:

amir-zeldes commented 1 year ago

In keeping with what I think is going on in most Penn corpora (and it seems EWT too, quantitatively), I would vote for splitting gonna/wanna/gotta/coulda and keeping kinda/sorta as single tokens, especially considering that the latter are now basically adverbs (so if we split them we would just use fixed to connect them again and tag ADV/advmod).

nschneid commented 1 year ago

Oh let's not forget our good friend "dunno" = "du + n + no"

AngledLuffa commented 1 year ago

A lot of these aren't represented in either treebank, but there's also gimme, lemme, finna, woulda, shoulda

amir-zeldes commented 1 year ago

Because these contain verbs, modals and referring expressions, and in the absence of precedents, I would vote to tokenize these apart - otherwise if you're doing mention detection for coref or pronoun resolution, you can't align the "me" in lemme or gimme to any token etc. I think of "kinda/sorta" differently just because a. there is precedent, and b. they don't cause the same kinds of problems, and as I wrote above would just end up being fixed+ADV, so we may as well treat them as contemporary univerbized adverbs.