Typo=Yes is used for tokens that are not typos

rhdunn commented 11 months ago

These are not typos so should not have Typo=Yes, but should have a Style attribute:

ERROR: Sentence GUM_reddit_social-44 token 13 -- PRP lemma 'you' does not match lowercase-form applied to form 'u', expected 'u'
ERROR: Sentence GUM_reddit_bobby-23 token 12 -- VB lemma 'come' does not match lowercase-form applied to form 'c’m', expected 'c'm'
ERROR: Sentence GUM_reddit_macroeconomics-6 token 6 -- NN lemma 'government' does not match lowercase-form applied to form 'govt', expected 'govt'
ERROR: Sentence GUM_reddit_macroeconomics-7 token 18 -- NN lemma 'government' does not match lowercase-form applied to form 'govt', expected 'govt'
ERROR: Sentence GUM_reddit_macroeconomics-15 token 3 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-18 token 13 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-20 token 3 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-23 token 11 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-24 token 4 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-24 token 18 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-25 token 3 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-25 token 20 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-30 token 16 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-39 token 27 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-54 token 17 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'
ERROR: Sentence GUM_reddit_macroeconomics-54 token 27 -- NN lemma 'government' does not match lowercase-form applied to form 'gov't', expected 'gov't'

They should also use CorrectLemma with the possible exception of "u".

This appears to be a perfectly cromulent user constructed word and not a typo.

ERROR: Sentence GUM_reddit_space-37 token 15 -- VBN lemma 'confabulate' does not match past-participle-verb applied to form 'confuggulated', expected 'confuggulate'

amir-zeldes commented 11 months ago

CorrectLemma is a wontfix for now.

As for the Typo issue, this all goes back to the target hypothesis origin of these annotations. "u" and "c'm" are definitely not the standard spellings for those words, so I have no qualms about correct forming them. One option would be to give them a Style, and then say that any item with a Style has no Typo (but can have CorrectForm). If that sounds good I could implement that.

nschneid commented 11 months ago

I don't understand the CorrectLemma suggestion (aren't all lemmas "correct" by definition?).

If it's "c'mon"—I consider that a standard spelling for a vernacular pronunciation. Style=Vrnc would make sense. (Although EWT currently has Abbr=Yes. I'll change that.)

I have been treating "u" as an abbreviated form (Abbr=Yes).

rhdunn commented 11 months ago

I suppose that these are all derivable. My concern is really whether the exception list for these is an open or closed set.

For instance, consider forms using h-dropping, g-dropping, etc. in things like the following in an eye dialect "I'm goin' t'otel later t'night wiv me bruvva." The question is then how much (if any) of that is expected to be understood by NLP tools. If the answer is all of it, then these really form an open set and lemmatizers and other tools will be expected to understand all these variations.

The use of CorrectLemma is then in the cases that are not in that closed set saying effectively "this is what the lemma is if the NLP software can understand the given eye dialect, vernacular, slang, etc. forms".

nschneid commented 11 months ago

I think of the lemma as the canonical dictionary headword where the meaning of a word would be listed. The question is how we consider our imaginary dictionary to be organized and scoped. If there are two very well established standards that appear in different texts (e.g. UK "realise" vs. US "realize"), then maybe we're essentially talking about different dictionaries, and both are valid as lemmas. If we're talking about a few informal or less standard forms mixed in, even due to productive spelling/pronunciation rules in certain dialects, I would expect a reader to be able to figure out the canonical spelling, so I'd apply normalization to the lemma. (At least in the context of the corpora I have worked with, which have a relatively low rate of non-"standard" spellings. If we were annotating an AAE corpus, we'd probably assume a different imaginary dictionary.)

amir-zeldes commented 11 months ago

OK, will go with Style=Vrnc for c'mon and Abbr for u.

UniversalDependencies / UD_English-GUMReddit

Typo=Yes is used for tokens that are not typos #12