UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

`ya` for `you`: colloquial or typo? #481

Closed AngledLuffa closed 6 months ago

AngledLuffa commented 7 months ago

In EWT, ya has been treated as

# sent_id = email-enronsent15_01-0039
2       ya      you     PRON    PRP     Case=Acc|Person=2|PronType=Prs|Style=Coll       1       obj     1:obj   SpaceAfter=No

whereas in GUM, it was:

# sent_id = GUM_whow_languages-49
34      ya      you     PRON    PRP     Case=Acc|Number=Sing|Person=2|PronType=Prs|Typo=Yes     33      obj     33:obj  CorrectForm=you|Entity=111)|XML=<sic ana:::"you"></sic>

This occurs once in PUD. I went with the Style feature, but I'm flexible

AngledLuffa commented 7 months ago

https://github.com/UniversalDependencies/UD_English-PUD/commit/e20a47a468ca5cef529e67279aba9efe6598c6d6

rhdunn commented 7 months ago

In either case -- like in GUM -- this should have a CorrectForm=you annotation.

It's not a typo. Style=Coll makes sense, as it is used in informal speech (i.e. colloquially) e.g. "How ya doin'?". "Y'all" is given as an example of vernacular (Style=Vrnc) but that's because it is mainly found in American English and not widespread. It's not slang (Style=Slng) as it follows vowel reduction and is not a different word to "you" -- this is comparative to "dosh" which is a slang term for "money".

AngledLuffa commented 7 months ago

It is not the only Coll to not have a CorrectForm

# sent_id = reviews-255261-0010
16      'em     they    PRON    PRP     Case=Acc|Number=Plur|Person=3|PronType=Prs|Style=Coll   15      iobj    15:iobj _

... compare to

# sent_id = reviews-018548-0005
6       em      they    PRON    PRP     Case=Acc|Number=Plur|Person=3|PronType=Prs|Style=Coll|Typo=Yes  5       obj     5:obj   CorrectForm='em

The standard in EWT seems to be to treat colloquial forms as correct, unless they're the wrong form of the colloquial form

nschneid commented 7 months ago

The EWT decisions are documented at https://universaldependencies.org/en/pos/PRON.html

AngledLuffa commented 7 months ago

Looks like GUM could use a small update to harmonize that ~lemma~ feature, then

@amir-zeldes

amir-zeldes commented 7 months ago

It's not hard to add the Style features to GUM, I propose this depedit:

text=/^(.?em|ya)$/&lemma=/they|you/ none    #1:morph+=Style=Coll
text=/[Yy][Oo]/&lemma=/your/    none    #1:morph+=Style=Slng
text=/^([Pp]rolly|[Dd]ef(fly)?)$/&xpos=/RB/ none    #1:morph+=Style=Slng
text=/^[Aa]i$/&lemma=/be/   none    #1:morph+=Style=Vrnc
text=/^[Yy].?all$/&xpos=/PRP/   none    #1:morph+=Style=Vrnc
text=/.*in'?/&xpos=/VBG/    none    #1:morph+=Style=Vrnc
text=/^(.?c[ou]z|.?cause)$/&xpos=/IN/   none    #1:morph+=Style=Vrnc
lemma=/thy|thou/    none    #1:morph+=Style=Arch
lemma=/you/&text=/[Yy]e/    none    #1:morph+=Style=Arch
text=/^([Ww]ilt|[Aa]rt|[Dd]ost)$/   none    #1:morph+=Style=Arch
text=/.*th/&xpos=/VBZ/  none    #1:morph+=Style=Arch
text=/^([Hh]mm+|[Ss]oo+|.*eee)$/    none    #1:morph+=Style=Expr

That should catch pretty much the same things as EWT, I think. However as I've pointed out before, GUM doesn't really have a Typo/CorrectForm annotation - these are just projected from the target hypotheses layer, which aims to normalize sentences to what the annotator considers to be 'standard English' (with some guidelines). This definitely includes 'em and ya, so those are all currently Typo, and if we want them not to be, we'd need to automate it somehow.

If others approve of the depedit option above for Style, I could just categorically say that things with Style=.* are not Typos, and just leave the CorrectForm for them?

nschneid commented 7 months ago

It is theoretically possible to have Style+Typo: EWT has "ya'll" which we consider a misspelling of "y'all". But the combination is presumably very rare.

rhdunn commented 7 months ago

Note that with Style=Arch the docs specify ModernForm=... instead of CorrectForm=....

amir-zeldes commented 7 months ago

OK, added the Style items from EWT to GUM - it would be nice to catch items in GUM that are not in EWT, but I'll leave that for another time...