Closed AngledLuffa closed 9 months ago
... although it should be noted that even the correctly split kinda
is not marked as an MWT, whereas words such as gonna
, wanna
, and ligma
get marked as MWT. Surely the MWT annotation is more correct
The inconsistency is definitely kinda wrong :)
But in terms of our intentions, "kinda" is meant to be a single token in GUM, as it is in other LDC corpora - if you look at the whole corpus you'll also notice it's 20:1 in favor of the single token analysis, so it's just one case that slipped between the cracks. Will fix upstream of course.
Ooh, I just realized why this one case was done differently: because it isn't the adverb kinda/RB! It's actually a real fusion of the non-lexicalized precursor where it's a noun governing a PP:
(so it's not "kinda a ligament" = "approximately a ligament")
That one is specifically different in terms of the ligament, but it does seem like in general this should be a amalgamation of "kind of". A similar thing would happen with "sorta" -> "sort of" if that ever showed up in GUM, although I don't see it here. "sorta" does show up in EWT one time, where it is split into "sort of". At any rate, I believe it is inconsistent that "sorta" gets split but "kinda" does not
I think probably we should handle "sorta" as one token as well. It's not that I have a super strong feeling about what is more 'correct' (is it a new word? Is it right to treat it 'etymologically'?). I basically just want things to stay consistent across datasets, and it looks like past corpora have been leaning into "kinda" as a new word, so it makes sense to me that "sorta" would be the same.
It doen't appear in PTB, the nearest I could find is the British National Corpus and COCA (one token, lots of instances), and one occurrence in the HCRC map task corpus where it was split. But in sum I would vote to keep these lexicalized adverbials together, especially since they would just be turned into fixed + ADV otherwise anyway.
"sorta" = sort + a (lemma "of") in EWT. Personally I don't have a strong opinion about the "right" way to handle these informal contractions (also "gotta", "gonna", "wanna", "coulda"...) but the Penn annotators seem to want to separate them.
EWT:
In keeping with what I think is going on in most Penn corpora (and it seems EWT too, quantitatively), I would vote for splitting gonna/wanna/gotta/coulda and keeping kinda/sorta as single tokens, especially considering that the latter are now basically adverbs (so if we split them we would just use fixed to connect them again and tag ADV/advmod).
Oh let's not forget our good friend "dunno" = "du + n + no"
A lot of these aren't represented in either treebank, but there's also gimme
, lemme
, finna
, woulda
, shoulda
Because these contain verbs, modals and referring expressions, and in the absence of precedents, I would vote to tokenize these apart - otherwise if you're doing mention detection for coref or pronoun resolution, you can't align the "me" in lemme or gimme to any token etc. I think of "kinda/sorta" differently just because a. there is precedent, and b. they don't cause the same kinds of problems, and as I wrote above would just end up being fixed+ADV, so we may as well treat them as contemporary univerbized adverbs.
Literally,
kinda
is wrong:This should in general be split into
kind of
, an analysis which happens in