UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
203 stars 43 forks source link

Report sent_id in errors #30

Closed msklvsk closed 5 years ago

msklvsk commented 5 years ago

[Tree number 4093 on line 86571]: Mismatch between the text attribute and the FORM field.

You cat *.conllu to pipe to validate.py to check for duplicate ids and whatnot. After the script is done with the first file, line numbers become non-informative. # sent_ids would reliably id the erroneous sentence.

dan-zeman commented 5 years ago

The new validator (for Python 3) reports both the line number and the sentence id:

./validate.sh --lang uk --max-err=10 UD_Ukrainian-IU/uk_iu-ud-dev.conllu
[Line 2789 Sent 142p]: Punctuation must not cause non-projectivity of nodes [31]
[Line 3867 Sent 2syn]: 'cc' not expected to have children ({'punct'})
[Line 4818 Sent 1l2x]: 'cop' not expected to have children ({'advmod'})
[Line 5097 Sent 1l8v]: 'cop' not expected to have children ({'advmod'})
[Line 5722 Sent 1lok]: 'advmod' should be 'ADV' but it is 'DET'
[Line 6250 Sent 1m1g]: 'aux' not expected to have children ({'advmod'})
[Line 7292 Sent 1wwz]: 'cop' not expected to have children ({'advmod'})
[Line 9331 Sent 1yg5]: 'cop' not expected to have children ({'advmod'})
[Line 9332 Sent 1yg5]: 'advmod' should be 'ADV' but it is 'DET'
...suppressing further errors regarding Syntax
*** FAILED *** with 17 errors
Syntax errors: 17
msklvsk commented 5 years ago

Cool!

Off the topic, I have some objections to the new validation rules.

image

Are the new validation rules mature enough so a separate issue can be opened to discuss them?

dan-zeman commented 5 years ago

The new validation rules are mature enough for separate issues to be opened (preferably in the docs issue tracker) about individual rules. Otherwise they are not mature, which means that discussion is welcome :-)