UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

Missing Style and CorrectForm annotations on RB tokens #472

Closed rhdunn closed 10 months ago

rhdunn commented 10 months ago

The following are missing Style and CorrectForm annotations, flagged by RB lemma validation checks:

ERROR: Sentence newsgroup-groups.google.com_RagnarokOnlineII_5730bc7888fcee99_ENG_20051122_035600-0003 token 20 -- RB lemma 'pretty' is not the lowercase form 'preety' text
ERROR: Sentence answers-20111107175720AAlb2TB_ans-0015 token 17 -- RB lemma 'basically' is not the lowercase form 'basic­ally' text
ERROR: Sentence reviews-037794-0003 token 1 -- RB lemma 'definitely' is not the lowercase form 'Def' text
ERROR: Sentence reviews-042012-0006 token 3 -- RB lemma 'forever' is not the lowercase form '4-ever' text
ERROR: Sentence reviews-018548-0002 token 5 -- RB lemma 'definitely' is not the lowercase form 'deffly' text
ERROR: Sentence reviews-018548-0004 token 12 -- RB lemma 'probably' is not the lowercase form 'prolly' text
nschneid commented 10 months ago

ERROR: Sentence answers-20111107175720AAlb2TB_ans-0015 token 17 -- RB lemma 'basically' is not the lowercase form 'basic­ally' text

This word contains a special character (a soft hyphen) which was presumably inserted by software: https://github.com/UniversalDependencies/UD_English-EWT/issues/83#issuecomment-1003266767

ERROR: Sentence reviews-037794-0003 token 1 -- RB lemma 'definitely' is not the lowercase form 'Def' text

ERROR: Sentence reviews-018548-0002 token 5 -- RB lemma 'definitely' is not the lowercase form 'deffly' text

ERROR: Sentence reviews-018548-0004 token 12 -- RB lemma 'probably' is not the lowercase form 'prolly' text

These are Abbr=Yes. I think I'll add Style=Slng (arguably the speaker is trying to sound "hip") and CorrectForm.

ERROR: Sentence reviews-042012-0006 token 3 -- RB lemma 'forever' is not the lowercase form '4-ever' text

Does this deserve some sort of Style? I can't tell if it's meant to be cutesy (Style=Expr) or is just an abbreviation.

ERROR: Sentence newsgroup-groups.google.com_RagnarokOnlineII_5730bc7888fcee99_ENG_20051122_035600-0003 token 20 -- RB lemma 'pretty' is not the lowercase form 'preety' text

Made this Style=Expr though it could just be a typo.

rhdunn commented 10 months ago

I'm happy for 4-ever to remain an Abbr=Yes. That case is missing a CorrectForm annotation, so my validator does not know it is an abbreviation for "forever". -- Other abbrevaiations that are not initialisms (BRB, PS, AD, etc.) have a CorrectForm annotation, e.g. "Sept.".