UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

Missing Style=Expr for epressive word forms #493

Open rhdunn opened 7 months ago

rhdunn commented 7 months ago

The following are Style=Expr instead of abbreviations.

See https://universaldependencies.org/u/feat/Style.html#expr-expressive-emotional:

Kinds of expressive spelling variation include: expressive lengthening (niiiiice), dialectal or colloquial pronunciation (Hahvahd), censored characters (sh*t), symbolic characters (CA$H), etc. As CA$H defies typographical convention it should also be labeled Typo=Yes.

These are also missing CorrectForm annotations:

dialectal or colloquial pronunciation

thru -> through

ERROR: Sentence email-enronsent42_01-0075 token 22 -- IN/Abbr=Yes lemma 'through' does not match lowercase-form applied to form 'thru', expected 'thru'
ERROR: Sentence answers-20111106215236AAycANO_ans-0024 token 2 -- IN/Abbr=Yes lemma 'through' does not match lowercase-form applied to form 'thru', expected 'thru'
ERROR: Sentence answers-20111106215236AAycANO_ans-0029 token 2 -- IN/Abbr=Yes lemma 'through' does not match lowercase-form applied to form 'thru', expected 'thru'
ERROR: Sentence reviews-348369-0006 token 15 -- IN/Abbr=Yes lemma 'through' does not match lowercase-form applied to form 'thru', expected 'thru'
ERROR: Sentence reviews-360937-0002 token 28 -- IN/Abbr=Yes lemma 'through' does not match lowercase-form applied to form 'thru', expected 'thru'

luv -> love

ERROR: Sentence reviews-042012-0005 token 19 -- NN/Abbr=Yes lemma 'love' does not match uppercase-form applied to form 'luv', expected 'LUV'
ERROR: Sentence reviews-042012-0006 token 11 -- NN/Abbr=Yes lemma 'love' does not match uppercase-form applied to form 'luv', expected 'LUV'
ERROR: Sentence reviews-042012-0007 token 1 -- NN/Abbr=Yes lemma 'love' does not match uppercase-form applied to form 'Luv', expected 'LUV'

Others:

WARN: Sentence answers-20111108075853AAUIKRQ_ans-0004 token 4 -- JJ/Abbr=Yes lemma 'good' does not have a validation rule for form 'gud'
WARN: Sentence answers-20111108102621AA3hPqj_ans-0007 token 15 -- JJ/Abbr=Yes lemma 'little' does not have a validation rule for form 'lil'
WARN: Sentence answers-20111108071852AAxbh5F_ans-0015 token 1 -- DT/Abbr=Yes lemma 'that' does not have a validation rule for form 'dat'
ERROR: Sentence reviews-038358-0003 token 20 -- IN/Abbr=Yes lemma 'for' does not match lowercase-form applied to form 'fo', expected 'fo'
ERROR: Sentence reviews-159371-0005 token 12 -- IN/Abbr=Yes lemma 'though' does not match lowercase-form applied to form 'tho', expected 'tho'

The following should have "because" as the lemma and CorrectForm:

ERROR: Sentence answers-20111107224336AAxQbzk_ans-0002 token 1 -- IN/Abbr=Yes lemma 'cause' does not match lowercase-form applied to form 'cos', expected 'cos'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 40 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'coz', expected 'coz'
ERROR: Sentence reviews-018548-0006 token 9 -- IN/Abbr=Yes lemma 'cause' does not match lowercase-form applied to form 'cus', expected 'cus'

symbolic characters, etc.

These also need Typo=Yes:

ERROR: Sentence answers-20111106230959AAuYQ5Q_ans-0005 token 2 -- IN/Abbr=Yes lemma 'to' does not match lowercase-form applied to form '2', expected '2'
ERROR: Sentence reviews-351950-0002 token 5 -- IN/Abbr=Yes lemma 'for' does not match lowercase-form applied to form '4', expected '4'
ERROR: Sentence reviews-100592-0005 token 8 -- IN/Abbr=Yes lemma 'for' does not match lowercase-form applied to form '4', expected '4'
ERROR: Sentence answers-20111108075853AAUIKRQ_ans-0002 token 21 -- TO/Abbr=Yes lemma 'to' does not match lowercase-form applied to form '2', expected '2'
ERROR: Sentence answers-20111108075853AAUIKRQ_ans-0002 token 40 -- NN/Abbr=Yes lemma 'anyone' does not match uppercase-form applied to form 'any1', expected 'ANY1'
ERROR: Sentence weblog-blogspot.com_healingiraq_20050121235804_ENG_20050121_235804-0030 token 10 -- VBN lemma 'fuck' does not match past-participle-verb applied to form 'f*ed', expected 'f*'