Closed AngledLuffa closed 6 months ago
In either case -- like in GUM -- this should have a CorrectForm=you
annotation.
It's not a typo. Style=Coll
makes sense, as it is used in informal speech (i.e. colloquially) e.g. "How ya doin'?". "Y'all" is given as an example of vernacular (Style=Vrnc
) but that's because it is mainly found in American English and not widespread. It's not slang (Style=Slng
) as it follows vowel reduction and is not a different word to "you" -- this is comparative to "dosh" which is a slang term for "money".
It is not the only Coll
to not have a CorrectForm
# sent_id = reviews-255261-0010
16 'em they PRON PRP Case=Acc|Number=Plur|Person=3|PronType=Prs|Style=Coll 15 iobj 15:iobj _
... compare to
# sent_id = reviews-018548-0005
6 em they PRON PRP Case=Acc|Number=Plur|Person=3|PronType=Prs|Style=Coll|Typo=Yes 5 obj 5:obj CorrectForm='em
The standard in EWT seems to be to treat colloquial forms as correct, unless they're the wrong form of the colloquial form
The EWT decisions are documented at https://universaldependencies.org/en/pos/PRON.html
Looks like GUM could use a small update to harmonize that ~lemma~ feature, then
@amir-zeldes
It's not hard to add the Style features to GUM, I propose this depedit:
text=/^(.?em|ya)$/&lemma=/they|you/ none #1:morph+=Style=Coll
text=/[Yy][Oo]/&lemma=/your/ none #1:morph+=Style=Slng
text=/^([Pp]rolly|[Dd]ef(fly)?)$/&xpos=/RB/ none #1:morph+=Style=Slng
text=/^[Aa]i$/&lemma=/be/ none #1:morph+=Style=Vrnc
text=/^[Yy].?all$/&xpos=/PRP/ none #1:morph+=Style=Vrnc
text=/.*in'?/&xpos=/VBG/ none #1:morph+=Style=Vrnc
text=/^(.?c[ou]z|.?cause)$/&xpos=/IN/ none #1:morph+=Style=Vrnc
lemma=/thy|thou/ none #1:morph+=Style=Arch
lemma=/you/&text=/[Yy]e/ none #1:morph+=Style=Arch
text=/^([Ww]ilt|[Aa]rt|[Dd]ost)$/ none #1:morph+=Style=Arch
text=/.*th/&xpos=/VBZ/ none #1:morph+=Style=Arch
text=/^([Hh]mm+|[Ss]oo+|.*eee)$/ none #1:morph+=Style=Expr
That should catch pretty much the same things as EWT, I think. However as I've pointed out before, GUM doesn't really have a Typo/CorrectForm annotation - these are just projected from the target hypotheses layer, which aims to normalize sentences to what the annotator considers to be 'standard English' (with some guidelines). This definitely includes 'em and ya, so those are all currently Typo, and if we want them not to be, we'd need to automate it somehow.
If others approve of the depedit option above for Style, I could just categorically say that things with Style=.* are not Typos, and just leave the CorrectForm for them?
It is theoretically possible to have Style+Typo: EWT has "ya'll" which we consider a misspelling of "y'all". But the combination is presumably very rare.
Note that with Style=Arch
the docs specify ModernForm=...
instead of CorrectForm=...
.
OK, added the Style items from EWT to GUM - it would be nice to catch items in GUM that are not in EWT, but I'll leave that for another time...
In EWT,
ya
has been treated aswhereas in GUM, it was:
This occurs once in PUD. I went with the
Style
feature, but I'm flexible