UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Should CorrectFEATURE annotations be used for typos in English treebanks? #1000

Closed rhdunn closed 6 months ago

rhdunn commented 12 months ago

While building an English lemma validator and using it to check the UD English treebanks, I've identified two cases where the LEMMA field (and sometimes the other fields such as XPOS) is based on the CorrectForm annotation instead of the FORM field:

  1. for standard/semi-standard abbreviations or expressive style on the author ('ve, u, etc.);
  2. for typos from spelling mistakes, or using a homophone of the intented word.

In both of these cases, the English treebanks are basing the remaining fields on the corrected form instead of the FORM field itself. For standard abbreviations, this makes sense. For typos, less so.

The UD guidelines on Wrong Morphology or Syntax states:

Suggestion: Keep the word as it was in the source text. Add morphological features that correspond to the actual form, not to the hypothetical correct form: English is is Number=Sing, and cars is Number=Plur.

The MISC attributes have CorrectFEATURE analogues to CorrectForm and CorrectSpaceAfter for annotating the corrected morphology. This states:

Shows the value of a morphological feature that would correspond to the correct form if a typo in the underlying text is fixed (while the actual value of the feature in FEATS should correspond to the actual form that appears in the text, as described in the guidelines for typos).

It would be helpful if typos followed these guidelines and suggestions, as NLP tools trained on these treebanks will be better able to derive the rules for lemmatization, part of speech tagging, etc.. It would also allow tools that provide corrections to better check and correct the associated morphology.

From https://tables.grew.fr/?data=ud_feats/MISC&cols=^Correct several languages make use of these annotations. EWT even has 3 instances of CorrectNumber to correct verbs and nouns, so these are at least partially (if inconsistently) attested in the English treebanks.

References

  1. https://universaldependencies.org/u/overview/typos.html#wrong-morphology-or-syntax
  2. https://universaldependencies.org/misc.html#correctfeature
  3. https://github.com/rhdunn/conllu-en-validator -- my validation tool. Note: the lemma validation currently has several false positives, notably for some verb patterns and for capitalised adjectives
  4. https://tables.grew.fr/?data=ud_feats/MISC&cols=^Correct

    Treebank Issues

  5. https://github.com/UniversalDependencies/UD_English-PUD/issues/41
  6. https://github.com/UniversalDependencies/UD_English-GUMReddit/issues/17

Note: the list of treebank-specific issues is currently incomplete as I haven't finished processing the results of the lemma analysis and raising issues in the other treebanks.

amir-zeldes commented 12 months ago

I don't object to it in principle, but it would require some labor to annotate this and integrate into pipelines, so it's not an easy fix.

rhdunn commented 12 months ago

@dan-zeman I'm replying here to your comment in the treebank specific issue.

I don't understand why we would need CorrectLemma at all. If there is a typo in the FORM, then Typo=Yes should be in FEATS, CorrectForm should be in MISC, but the correct lemma should be in the LEMMA column. No need to show a wrong lemma there.

With abbreviations (such as 've, u, etc.), those form a closed set of small cases that are equivalent to the expanded form and don't clash with other words. As such, a lemmatizer can add those as exceptions, expand the word, then lemmatize the expansion.

With typos, those are an open set and are much messier than abbreviations. As such, an NLP processor can do one of 3 things with typos:

  1. ignore them -- in which case the LEMMA, UPOS, XPOS, FEATS, DEPS, and MISC will all reflect the surface text, not the corrected text;
  2. correct them after tokenization but before the other steps -- in this case they will match what is currently in the English treebanks;
  3. correct them after NLP and use information from that to do the corrections, so the initial result is in the main attributes and the corrected version is in the CorrectFEATURE annotations.

Note: when the corrected feature is the same as the base feature, the corrected feature should not be present. It is only when the two are different (e.g. the change in Number in the EWT treebank) that the corrected feature should be set.

When training an NLP system, those work like 1 or 3. Therefore, any inconsistencies due to the LEMMA, XPOS, etc. being the corrected versions will result in the NLP being unpredictable.

For example, I've seen trained NLP systems split "its" as "it's" in some cases which is likely due to it learning the wrong behaviour as a result of learning the corrected features and getting confused on valid PRP instances. -- This is especially true when these corrections are usually limited to one or two cases, so the NLP system may learn the correction in broader contexts than it should.

Note: The motivation for me writing my validation tool was so that I'd be able to identify and correct these and other types of issue.

Examples

These are some examples of where CorrectLemma and other similar features would be useful. It is not an exhaustive list as I have not done a full analysis of the lemma validation on the EWT treebank yet.

Example 1: answers-20090205181308AAZghOH_ans-0002 in EWT

That has various typos (afnd -> and).

These are straightforward typos for the corresponding words. In order to correct these typos, a lemmatizer would need a big open-ended list of form to corrected form in order to fix. -- This case is doable, but not all lemmatizers would implement support for error correction like they would for expanding common abbreviations.

Note: I could see these LEMMA fields remaining as the corrected versions, but these should have CorrectLemma=and, etc. due to these being non-standard spellings in line with the Wrong Morphology or Syntax suggestion of using the form and not the hypothetical correct form.

Example 2: email-enronsent05_01-0009 in EWT

This has a typo replacing a word with a different word (who -> how).

While these can be handled in the same way as example 1, they create an ambiguity on what the lemma is from the given FORM, XPOS, and FEATS information. As such, a lemmatizer trained on this could start generating "how" as the lemma for "who" in places where there are no typos.

Here, CorrectLemma=how should be used to avoid errors from NLP tools/models only working on the base FORM text.

Example 3: reviews-097507-0002 in EWT

This has a typo replacing a word with a different word in a different part of speech class (know+VERB -> no+ADV).

I could see how this case could complicate a base part of speech model trained on the uncorrected version. It does mean that the part of speech tagger needs to be aware of the misuse of a homophone in order to assign the correct ADV+RB part of speech, and for the lemmatizer to know about the ADV exception for the typo.

Example 4: weblog-blogspot.com_tacitusproject_20040715092419_ENG_20040715_092419-0010 in EWT

This has a typo replacing a word with a different word in a different part of speech class (hear+VERB -> here+ADV).

This is from "So hear we are". This correction needs more complex analysis than the pattern/sequence of valid part of speech tags that Example 3 needed. Specifically, the following is perfectly valid and shouldn't be corrected:

So hear us do this! Let's sing now.

An NLP model may learn that "so" followed by "hear" makes "hear" an ADV. It may also/alternatively learn that "hear" sometimes has the lemma "here" and would start using that lemma in the wrong contexts.

nschneid commented 12 months ago

We are bound by the Typos policy, which does NOT suggest CorrectLemma. If you look at the "kats" example, it has "cat" as the lemma, Typo=Yes, and CorrectForm=cats.

suggestion of using the form and not the hypothetical correct form

I think this "Suggestion" is simply saying, don't alter the surface form of the input, and if that surface form uses the wrong inflection of the word, choose morphological features to reflect that incorrect inflection in FEATS (with corrected inflectional features in MISC). So if somebody wrote "cat" or "kat" when they meant "cats", it would ideally be Number=Sing and CorrectNumber=Plur. I don't think this passage has any bearing on lemmas.

That policy about the inflectional features seems to presume that the surface form can be decoded as an actual word with the same UPOS as the word that would be grammatically correct in the context. If "who" is substituted for "how", I would probably just annotate it like "how" except for indicating it is a misspelling. (I.e., it is probably just a typo, not a morphological error.)

If you are worried about throwing off lemmatizers or taggers that might be trained on the data, I think any such training would benefit from taking into account whether the word is marked as a typo. Yes, "who" -> "how" is going to be very rare as a correction, but a sophisticated model might take context into account. In any case, building the lemmatizer/tagger is someone else's problem. :)

rhdunn commented 12 months ago

Thanks! That makes sense. I'll revert back to using the correct form in my lemma validator :).

dan-zeman commented 11 months ago

The Wrong Morphology or Syntax section is not about lemmatization. It is about cases where you have an existing form of the same lemma as the correct form would have, but the form actually present is not in the required case/number/whatever. And even in this realm the section is somewhat vague, admitting that each case may be different. The crucial point is that it may not be always easy to guess what is the intended correct sentence. If you are sure about it and are willing to analyze the word as who while the actual string is how, then you should simply annotate it as who with Typo=Yes, FORM = how, CorrectForm=who, and LEMMA = who.

It is up to the person who uses the data for training to decide how best to use the data. Perhaps the model will perform better if sentences with typos are skipped, or if corrected forms are used for parts of the training. Ideally the model should also learn to predict the lemma who in cases where who was intended but how was typed. People may be searching for patterns in text by lemmata and then this would help. But of course typos are not frequent and learning proper context for how --> who is very difficult.

rhdunn commented 11 months ago

Yeah, that was me misreading that section. I've since updated my lemma validator to use the CorrectForm annotation if present.

nschneid commented 6 months ago

EWT now has some inflectional errors annotated this way, e.g. 32 instances of CorrectNumber (27 VERB, 3 AUX, 2 NOUN). Typically, the word is in its bare form when it should have an -s suffix.

GUM may not implement it because in the GUM pipeline, features are autogenerated from xpos.