UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
205 stars 44 forks source link

Validation requirements for a treebank to be released in 2.5 #54

Closed dan-zeman closed 5 years ago

dan-zeman commented 5 years ago

Following the discussion at the end of UDW 2019 in Paris, I tried to put together a proposal of the validation vs. release policy for the upcoming releases. The goal is to be able to add new tests and find more guideline violations, but without having to kick out older treebanks that do not pass the stricter tests (some of them are no longer maintained and there is no one who could fix the bugs soon; others have too many bugs and fixing them will take a lot of time).

The full proposal is currently available here and comments are welcome. In a nutshell: if a treebank was valid and released in UD 2.3, it can stay in the upcoming releases without passing tests that were added after UD 2.3. Newer treebanks must pass all tests that exist when the treebank is released for the first time.

I have modified the online validation page to reflect the proposal and identify treebanks with legacy status. There are 6 old treebanks that contain errors which were not tolerated even in UD 2.3 (that means, these errors were introduced in UD 2.4 and slipped attention of the release team). Errors of this type must be fixed before UD 2.5. The treebanks are Croatian-SET (@nljubesi), English-EWT (@manning @sebschu), French-Spoken (@sylvainkahane), Norwegian-Bokmaal, Norwegian-NynorskLIA (@liljao), Serbian-SET (@tsamardzic).

4 treebanks were released in UD 2.4 for the first time but contained errors that were already checked at that time. Hence I think they are not really legacy treebanks (the only reason why they made it into the release was that we ignored some error messages in order to save older treebanks). (Disclaimer: I’m actually looking at the current report, so it is possible that the errors were not there at release time and were introduced later.) The treebanks are Classical_Chinese-Kyoto (@KoichiYasuoka), German-HDT (@akoehn @EmanuelUHH), German-LIT (@a-salomoni), Old_Russian-RNC (@olesar).

Finally, issues are also reported for 4 new treebanks: Bhojpuri-BHTB (@shashwatup9k), Chinese-GSDSimp (@qipeng), Skolt_Sami-Giellagas (@rueter), Swiss_German-UZH (@noe-eva).

What do people think about this?

dan-zeman commented 5 years ago

P.S. If you believe that a validation rule is too strict (i.e., requires something that does not follow from the guidelines), please raise an issue in the issue tracker of the docs repository.

jnivre commented 5 years ago

Thanks for picking up this thread, Dan. I think there is something wrong with the link to the full proposal. I just get an empty page.

KoichiYasuoka commented 5 years ago

Thank you, Dan. I appended my opinion at advmod but not UPOS=ADV. I think the validation rule for advmod too strict.

dan-zeman commented 5 years ago

Thanks for picking up this thread, Dan. I think there is something wrong with the link to the full proposal. I just get an empty page.

Oops. Thanks for the heads-up. It looks like the name I picked for the page was already taken by the old (and obsolete) validation machinery, which generates an empty page each time a corpus is modified, and it was also the case of https://github.com/UniversalDependencies/docs/commit/76ad6d5c06176f8532577b865e280fbde47d9432 :-) (@fginter)

I have now renamed the page to validation-rules.

dan-zeman commented 5 years ago

I appended my opinion at advmod but not UPOS=ADV.

Thanks, Koichi. See my answer there.

KoichiYasuoka commented 5 years ago

For the nagation of aux in old issue for UD 2.4:

One of the possible exceptions is negation. So you can actually attach the first 不 directly to the auxiliary, and the validator should accept it if 不 has the feature Polarity=Neg.

but now the validator for UD 2.5 does not accept the negation of aux. We've already added Polarity=Neg for all 不, then how do we do with the new validator?

dan-zeman commented 5 years ago

See my answer there. It should help if the negative particle is tagged PART.