Open dan-zeman opened 8 years ago
Typo=Yes
didn't make it to v2, and I believe the issue remains open. Moving to later.
Thanks for the manuscript @dan-zeman . One more issue on that:
third example in parataxis http://universaldependencies.org/pt/dep/parataxis.html is wrong, "serie" is not "serei". the word "serie" is a NOUN, not the VERB that was intended, so this is a corpus typo, in the guidelines too. bad form.
@dan-zeman I can't find the errors above in the source file parataxid.md
. I am still waiting for instruction from @fginter for testing the documentation pages. I also noted that now we have a yellow line on the top of the PT doc pages saying that they are outdated, right? How to fix it?
@dan-zeman I found. The example cited by @vcvpaiva is actually a sentence from the UD_Portuguese corpus under the section Treebank Statistics (UD_Portuguese). The example seems right regarding the use of parataxis, but it does contain a corpus typo.
Another discussion of typos: https://github.com/UniversalDependencies/docs/issues/393#issuecomment-271370394
There has not been a formal amendment of the UD guidelines that would address typos, but the issue keeps popping up and people are repeatedly referred to this thread and to the draft Typos.pdf that I attached above.
To make it more easily accessible, I have rewritten the recommendations in MarkDown (the contents is slightly modified and extended too), uploaded them here and added a link from the guidelines. Annotating typos is optional, so there shouldn't be any conflict with the v2 guidelines. But I believe that it is useful if people who want to annotate it will use the same labels.
I am leaving this issue open for a while to see if there are objections or comments.
Thanks, Dan. This all looks fine to me. One can discuss whether "Typo=Yes" is really a morphological feature, but I can see the value of including it in the FEATS column, so that typos can be filtered out when generating or checking morphological paradigms.
We do not have a common guideline for the preferred way of dealing with typos. As pointed out by @vcvpaiva in https://github.com/UniversalDependencies/UD_Portuguese/issues/5, annotators of some treebanks (that are now converted to UD) were explicitly asked to keep typos intact, but this may not be true everywhere, and at least new, native-UD annotation efforts may benefit from knowing what is the preferred approach. I have outlined some possibilities in a document I attach to this post:
Typos.pdf