UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Typo=Yes should not apply just because of an extra space #39

Open nschneid opened 2 years ago

nschneid commented 2 years ago

I believe 4 of these 5 instances are incorrect because the extra space is already accounted for by goeswith.

amir-zeldes commented 2 years ago

Is that a universal guideline? Personally I would consider extra spaces to be a kind of Typo, and it also explains why the tokens around a goeswith are mangled. If I were searching for all typos in a corpus I think I'd want to find these too. Is there anything that speaks against having goeswith as well as Typo for these tokens?

nschneid commented 2 years ago

The guidelines on typos. I think the function of Typo=Yes is to signal that some part of the tokenized wordform contains incorrect, incorrectly ordered, or missing non-space characters. In English we have things like "any were" for "anywhere", so it applies on the second "word" but not the first.

nschneid commented 2 years ago

If I were searching for all typos in a corpus I think I'd want to find these too.

Missing spaces are also typos in a broader sense but not flagged with Typo=Yes (Which word would that be on? Both? What if it is a missing space before or after punctuation? etc.). The convention is to use CorrectSpaceAfter=Yes|SpaceAfter=No.

amir-zeldes commented 2 years ago

OK, it's not my intuition but I can live with it either way.

amir-zeldes commented 2 years ago

So, looking at this more closely, would you say in this example:

The first token should still carry Typo=Yes, no? Or do we assume that "goeswith" covers the statement "this is broken"? If not, then we can't just tell the validator to disallow Typo on goeswith.

nschneid commented 2 years ago

I would put CorrectForm=s|Typo=Yes on the second token because its form is "is" rather than "s". In cases where the two forms connected by goeswith concatenate to the correctly spelled word, no Typo or CorrectForm feature.

amir-zeldes commented 2 years ago

I see what you're saying, but I thought in goeswith, the second token basically doesn't exist in terms of features, so I would have expected token1 to carry the Typo: goeswith is saying "lose the space", and the first token then carries everything the merged token has to say, including "'itis' is a typo for 'its'"

nschneid commented 2 years ago

What the docs say:

The head should also bear the part-of-speech tag and morphological annotation of the entire word. It is not necessary to add the Typo feature and CorrectForm in MISC, unless there is a “normal” typo too, i.e. if simple concatenation of the parts does not yield the correct form. Example:

I guess it leaves ambiguous where the Typo/CorrectForm should be if there is a normal typo too.

Maybe I should draft a document spelling out the formal constraints at play and pseudocode for producing a canonical representation without typos/misuse of spaces/repair.

amir-zeldes commented 2 years ago

Forgot to answer: yes, that would be great! Upon thinking about it, I think I'd prefer for typo to be on the first token in goeswith, since the second/third/subsequent parts of a broken token don't really have an 'expected' spelling IMO - it's the merged token which has a standard spelling, and that is what is being deviated from (plus it's easier to say anything with deprel goeswith can't have Typo, or any other meaningful FEAT really)