Closed rhdunn closed 9 months ago
In the first case I don't think it's quite right to say that the lemma is "mildly". The sentence reads:
And the target hypothesis says it should be "mildly" in standard English. So it's the wrong choice of word, but it is properly an occurrence of 'mild', so I think 'mild' is also the correct lemma (for an arguably incorrectly chosen word)
The middle group has genuine errors, will fix. The last one is an issue with how <sic ana="..">
works, i.e. the discrepancy between the notion of target hypothesis and the conllu Typo annotation. Unless there is a good alternative suggestion (which is ideally automatable), I will leave those alone.
With "mild" it is annotated with Typo=Yes
and CorrectForm=mildly|XML=<sic ana::"mildly"></sic>
so it has been corrected to "mildly". As a result the lemma should be "mildly" to be consistent with the corrected form annotation. Otherwise, it shouldn't be annotated as a typo with a corrected form if the lemma matches the FORM
field.
For the third group, ideally the missing word should be a separate empty node token (https://universaldependencies.org/format.html#words-tokens-and-empty-nodes) with the ellipsis
deprel (https://universaldependencies.org/u/overview/typos.html#missing-word). I don't know how automatable this is, as I'm not familiar with the GUM infrastructure.
With "mild" it is annotated with Typo=Yes and CorrectForm=mildly|XML=<sic ana::"mildly"> so it has been corrected to "mildly"
This is the conllu output from the upstream GUM data, yes. But conceptually, all of GUM's error annotations come from the 'target hypothesis' paradigm I mentioned above, which works a bit differently. It's been mapped onto 'Typo' and 'CorrectForm' because that seemed like the best match for those annotations in the UD ecosystem, but they are not quite the same, and this is one of the cases where the mismatch shows up.
The way I see it, this isn't really a typo so much as non-standard language or incorrect word choice. Consider what might happen if a non-native speaker writes "of my opinion", instead of "in my opinion". In GUM's native "sic" layer, you would get the correction of -> in. I would even go so far as to say it's not wrong to say the CorrectForm is "in", but I wouldn't want to say that the lemma of the token "of" is "in" - does that make sense?
For the third group, ideally the missing word should be a separate empty node token
I think we chatted about this in some other issue, but I believe that empty nodes are explicitly not meant to be used for missing material other than gapping/right-node-raising etc. For example, a missing article due to a grammar error has been used as an example of what Ellipsis nodes should not be used for (I forget the issue where this was debated). If we did reconstruct them, I would give them the usual edeprel based on their function (no need for a regular deprel for empty nodes)
The following lemma is incorrectly lemmatized:
The following lemmas should be lower case:
The following lemmas are flagged because the
CorrectForm
contains multiple words: