UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
203 stars 43 forks source link

validate.py should be robust #14

Closed martinpopel closed 7 years ago

martinpopel commented 7 years ago

and it should not fail with exception on invalid data, it should rather report which lines are invalid.

This commit prevents exceptions when a CoNLL-U contains e.g. just one column instead of 10 columns. It is not very elegant, I admit. Maybe it would be better to just jump to the next tree after reporting less than 10 columns.

fginter commented 7 years ago

Merged. But to put things straight: the validator was not failing with an exception. It caught and reported the exception. :) I think a solution would be to move the try ... except block https://github.com/UniversalDependencies/tools/blob/master/validate.py#L706 tighter around the sentence. That would pretty much achieve what you suggest: i.e. fail and move on to the next sentence. Then again, if there is a wrong number of columns, it's fine to scream and quit - the treebank is broken and whoever is working on it will be fixing it anyway. I think it's much less important when and how the validator reports, providing it doesn't pass incorrect stuff through.

martinpopel commented 7 years ago

I fully agree.

fginter commented 7 years ago

@martinpopel I am a bit scared to rock the boat this late in the game, but if you have a bit of free time at your hands, it would be good to have a third pair of eyes look at the validator's test suite and see nothing is forgotten. In the past few releases, there's always been one treebank which managed to sneak wrong data through the validator. I'd hate to have that happen again. :slightly_frowning_face: Thanks! :+1:

martinpopel commented 7 years ago

@fginter: I am not aware of any missing tests in validate.py. I have some ideas for svalidation e.g. that SYM and PUNCT are not alphabetical. I plan to devote my today's free time to fixing the treebanks which FAIL validation (and it seems their maintainers don't plan to fix it today before the data freeze).