UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

How to deal with typos and grammatical errors in the underlying text #330

Open dan-zeman opened 8 years ago

dan-zeman commented 8 years ago

We do not have a common guideline for the preferred way of dealing with typos. As pointed out by @vcvpaiva in https://github.com/UniversalDependencies/UD_Portuguese/issues/5, annotators of some treebanks (that are now converted to UD) were explicitly asked to keep typos intact, but this may not be true everywhere, and at least new, native-UD annotation efforts may benefit from knowing what is the preferred approach. I have outlined some possibilities in a document I attach to this post:

Typos.pdf

spyysalo commented 7 years ago

Typo=Yes didn't make it to v2, and I believe the issue remains open. Moving to later.

vcvpaiva commented 7 years ago

Thanks for the manuscript @dan-zeman . One more issue on that:

third example in parataxis http://universaldependencies.org/pt/dep/parataxis.html is wrong, "serie" is not "serei". the word "serie" is a NOUN, not the VERB that was intended, so this is a corpus typo, in the guidelines too. bad form.

arademaker commented 7 years ago

@dan-zeman I can't find the errors above in the source file parataxid.md. I am still waiting for instruction from @fginter for testing the documentation pages. I also noted that now we have a yellow line on the top of the PT doc pages saying that they are outdated, right? How to fix it?

arademaker commented 7 years ago

@dan-zeman I found. The example cited by @vcvpaiva is actually a sentence from the UD_Portuguese corpus under the section Treebank Statistics (UD_Portuguese). The example seems right regarding the use of parataxis, but it does contain a corpus typo.

dan-zeman commented 7 years ago

Another discussion of typos: https://github.com/UniversalDependencies/docs/issues/393#issuecomment-271370394

dan-zeman commented 5 years ago

There has not been a formal amendment of the UD guidelines that would address typos, but the issue keeps popping up and people are repeatedly referred to this thread and to the draft Typos.pdf that I attached above.

To make it more easily accessible, I have rewritten the recommendations in MarkDown (the contents is slightly modified and extended too), uploaded them here and added a link from the guidelines. Annotating typos is optional, so there shouldn't be any conflict with the v2 guidelines. But I believe that it is useful if people who want to annotate it will use the same labels.

I am leaving this issue open for a while to see if there are objections or comments.

jnivre commented 5 years ago

Thanks, Dan. This all looks fine to me. One can discuss whether "Typo=Yes" is really a morphological feature, but I can see the value of including it in the FEATS column, so that typos can be filtered out when generating or checking morphological paradigms.