UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

Incorrect lemma casing for the assigned part of speech #486

Closed rhdunn closed 2 months ago

rhdunn commented 10 months ago

common noun adjectives

ERROR: Sentence answers-20111107092617AAgKm4X_ans-0015 token 15 -- JJ lemma 'South' does not match lowercase-form applied to form 'South', expected 'south'
ERROR: Sentence answers-20111107092617AAgKm4X_ans-0022 token 3 -- JJ lemma 'South' does not match lowercase-form applied to form 'South', expected 'south'
ERROR: Sentence answers-20111107092617AAgKm4X_ans-0024 token 3 -- JJ lemma 'South' does not match lowercase-form applied to form 'SOUTH', expected 'south'
ERROR: Sentence reviews-342807-0001 token 7 -- JJ lemma 'West' does not match lowercase-form applied to form 'West', expected 'west'
ERROR: Sentence reviews-342807-0002 token 13 -- JJ lemma 'West' does not match lowercase-form applied to form 'West', expected 'west'
ERROR: Sentence reviews-342807-0004 token 16 -- JJ lemma 'West' does not match lowercase-form applied to form 'West', expected 'west'

proper noun adjectives

ERROR: Sentence reviews-047184-0004 token 13 -- JJ lemma 'latin' does not match lemma-exception applied to form 'LATIN', expected 'Latin'
ERROR: Sentence reviews-047184-0005 token 25 -- JJ lemma 'latin' does not match lemma-exception applied to form 'Latin', expected 'Latin'

currencies

ERROR: Sentence answers-20111108064026AA86V9T_ans-0003 token 29 -- $ lemma 'rs' does not match unmodified-form applied to form 'Rs', expected 'Rs'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0004 token 18 -- $ lemma 'rs' does not match unmodified-form applied to form 'Rs', expected 'Rs'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0027 token 9 -- $ lemma 'rs' does not match unmodified-form applied to form 'Rs', expected 'Rs'

lists

The practice is to not lowercase roman numberals, so letter form lists should not be lowercased either:

ERROR: Sentence email-enronsent36_02-0075 token 28 -- LS lemma 'a' does not match normalized-form applied to form 'A', expected 'A'

other

This is an x used as a replacement/alternative for a times symbol:

ERROR: Sentence email-enronsent36_02-0050 token 10 -- IN lemma 'X' does not match lowercase-form applied to form 'X', expected 'x'
nschneid commented 2 months ago

common noun adjectives

These are actually adjectives within proper names (South Vietnamese, West Indian) so I think the capitalization is correct.

This is an x used as a replacement/alternative for a times symbol:

Changed the tagging to SYM/SYM as in other tokens where this symbol is used.

Implemented all the other suggestions, thanks!