UniversalDependencies / UD_English-GUMReddit

Other
1 stars 2 forks source link

Incorrect lemmatization of several tokens #16

Open rhdunn opened 9 months ago

rhdunn commented 9 months ago

less -> little

ERROR: Sentence GUM_reddit_macroeconomics-46 token 18 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_callout-20 token 18 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_callout-32 token 5 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_card-44 token 5 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_gender-48 token 6 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_introverts-9 token 14 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_introverts-9 token 18 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_racial-25 token 4 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_ring-64 token 20 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'
ERROR: Sentence GUM_reddit_stroke-20 token 2 -- RBR lemma 'less' does not match lemma-exception applied to form 'less', expected 'little'

other irregular

ERROR: Sentence GUM_reddit_macroeconomics-50 token 21 -- JJR lemma 'further' does not match lemma-exception applied to form 'further', expected 'far'
ERROR: Sentence GUM_reddit_escape-28 token 4 -- VBD lemma 'thought' does not match lemma-exception applied to form 'THOUGHT', expected 'think'

pronouns

ERROR: Sentence GUM_reddit_introverts-26 token 16 -- PRP lemma 'mine' does not match lemma-exception applied to form 'mine', expected 'my'

mismatched for the part of speech

ERROR: Sentence GUM_reddit_introverts-12 token 4 -- NN lemma 'date' does not match lowercase-form applied to form 'dating', expected 'dating'
ERROR: Sentence GUM_reddit_polygraph-1 token 11 -- NN lemma 'device' does not match lowercase-form applied to form 'Devices', expected 'devices'

UK vs US

These are lemmatizing to the US spelling.

ERROR: Sentence GUM_reddit_callout-40 token 25 -- VBN lemma 'prioritize' does not match past-participle-verb applied to form 'prioritised', expected 'prioritise'
amir-zeldes commented 9 months ago

The lemma of less is currently not little in any corpus I know. It's conceivable I suppose, but if we change this corpus it would be out of sync with all the others, so I wouldn't unless there's a big push to do it.

For further I recall some principled decision to only lemmatize farther/further if it's in a context that could take "far", whereas the discourse adverb is lemmatized "further". I could be wrong, but it looks like EWT is the same.

The other errors are fixed, except for the UK thing - I'm not sure whether we want to lemmatize UK+US spelllings together or not. UK spelling is also accepted without <sic ana> in GUM as a policy. @nschneid ?

nschneid commented 9 months ago

I don't recall a practice of normalizing UK vs. US spellings in the lemma, and EWT has both "realize" and "realise" for example. If it's important to group them together I'd probably do it via a new MISC feature rather than mess with lemmas (since lemmatizers are probably not trained to normalize by default?).