UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Incorrect lemmas for some RB tokens #75

Closed rhdunn closed 9 months ago

rhdunn commented 9 months ago

The following lemma is incorrectly lemmatized:

ERROR: Sentence GUM_news_iodine-3 token 10 -- RB lemma 'mild' is not the lowercase form 'mildly' text

The following lemmas should be lower case:

ERROR: Sentence GUM_voyage_oakland-27 token 1 -- RB lemma 'South' is not the lowercase form 'South' text
ERROR: Sentence GUM_conversation_family-53 token 1 -- RB lemma 'Out' is not the lowercase form 'Out' text
ERROR: Sentence GUM_conversation_family-54 token 1 -- RB lemma 'Out' is not the lowercase form 'Out' text

The following lemmas are flagged because the CorrectForm contains multiple words:

ERROR: Sentence GUM_academic_lighting-12 token 26 -- RB lemma 'well' is not the lowercase form 'well as' text
ERROR: Sentence GUM_interview_mckenzie-27 token 3 -- RB lemma 'really' is not the lowercase form 'a really' text
ERROR: Sentence GUM_speech_floyd-36 token 10 -- RB lemma 'profoundly' is not the lowercase form 'are profoundly' text
ERROR: Sentence GUM_vlog_covid-46 token 19 -- RB lemma 'really' is not the lowercase form 'a really' text
ERROR: Sentence GUM_vlog_wine-12 token 8 -- RB lemma 'naturally' is not the lowercase form 'a naturally' text
ERROR: Sentence GUM_whow_languages-29 token 7 -- RB lemma 'not' is not the lowercase form 'are not' text
amir-zeldes commented 9 months ago

In the first case I don't think it's quite right to say that the lemma is "mildly". The sentence reads:

And the target hypothesis says it should be "mildly" in standard English. So it's the wrong choice of word, but it is properly an occurrence of 'mild', so I think 'mild' is also the correct lemma (for an arguably incorrectly chosen word)

The middle group has genuine errors, will fix. The last one is an issue with how <sic ana=".."> works, i.e. the discrepancy between the notion of target hypothesis and the conllu Typo annotation. Unless there is a good alternative suggestion (which is ideally automatable), I will leave those alone.

rhdunn commented 9 months ago

With "mild" it is annotated with Typo=Yes and CorrectForm=mildly|XML=<sic ana::"mildly"></sic> so it has been corrected to "mildly". As a result the lemma should be "mildly" to be consistent with the corrected form annotation. Otherwise, it shouldn't be annotated as a typo with a corrected form if the lemma matches the FORM field.

rhdunn commented 9 months ago

For the third group, ideally the missing word should be a separate empty node token (https://universaldependencies.org/format.html#words-tokens-and-empty-nodes) with the ellipsis deprel (https://universaldependencies.org/u/overview/typos.html#missing-word). I don't know how automatable this is, as I'm not familiar with the GUM infrastructure.

amir-zeldes commented 9 months ago

With "mild" it is annotated with Typo=Yes and CorrectForm=mildly|XML=<sic ana::"mildly"> so it has been corrected to "mildly"

This is the conllu output from the upstream GUM data, yes. But conceptually, all of GUM's error annotations come from the 'target hypothesis' paradigm I mentioned above, which works a bit differently. It's been mapped onto 'Typo' and 'CorrectForm' because that seemed like the best match for those annotations in the UD ecosystem, but they are not quite the same, and this is one of the cases where the mismatch shows up.

The way I see it, this isn't really a typo so much as non-standard language or incorrect word choice. Consider what might happen if a non-native speaker writes "of my opinion", instead of "in my opinion". In GUM's native "sic" layer, you would get the correction of -> in. I would even go so far as to say it's not wrong to say the CorrectForm is "in", but I wouldn't want to say that the lemma of the token "of" is "in" - does that make sense?

For the third group, ideally the missing word should be a separate empty node token

I think we chatted about this in some other issue, but I believe that empty nodes are explicitly not meant to be used for missing material other than gapping/right-node-raising etc. For example, a missing article due to a grammar error has been used as an example of what Ellipsis nodes should not be used for (I forget the issue where this was debated). If we did reconstruct them, I would give them the usual edeprel based on their function (no need for a regular deprel for empty nodes)