Closed rhdunn closed 11 months ago
CorrectLemma is a wontfix for now.
As for the Typo issue, this all goes back to the target hypothesis origin of these annotations. "u" and "c'm" are definitely not the standard spellings for those words, so I have no qualms about correct forming them. One option would be to give them a Style, and then say that any item with a Style has no Typo (but can have CorrectForm). If that sounds good I could implement that.
I don't understand the CorrectLemma suggestion (aren't all lemmas "correct" by definition?).
If it's "c'mon"—I consider that a standard spelling for a vernacular pronunciation. Style=Vrnc
would make sense. (Although EWT currently has Abbr=Yes
. I'll change that.)
I have been treating "u" as an abbreviated form (Abbr=Yes
).
I suppose that these are all derivable. My concern is really whether the exception list for these is an open or closed set.
For instance, consider forms using h-dropping, g-dropping, etc. in things like the following in an eye dialect "I'm goin' t'otel later t'night wiv me bruvva." The question is then how much (if any) of that is expected to be understood by NLP tools. If the answer is all of it, then these really form an open set and lemmatizers and other tools will be expected to understand all these variations.
The use of CorrectLemma is then in the cases that are not in that closed set saying effectively "this is what the lemma is if the NLP software can understand the given eye dialect, vernacular, slang, etc. forms".
I think of the lemma as the canonical dictionary headword where the meaning of a word would be listed. The question is how we consider our imaginary dictionary to be organized and scoped. If there are two very well established standards that appear in different texts (e.g. UK "realise" vs. US "realize"), then maybe we're essentially talking about different dictionaries, and both are valid as lemmas. If we're talking about a few informal or less standard forms mixed in, even due to productive spelling/pronunciation rules in certain dialects, I would expect a reader to be able to figure out the canonical spelling, so I'd apply normalization to the lemma. (At least in the context of the corpora I have worked with, which have a relatively low rate of non-"standard" spellings. If we were annotating an AAE corpus, we'd probably assume a different imaginary dictionary.)
OK, will go with Style=Vrnc for c'mon and Abbr for u.
These are not typos so should not have
Typo=Yes
, but should have aStyle
attribute:They should also use
CorrectLemma
with the possible exception of "u".This appears to be a perfectly cromulent user constructed word and not a typo.