clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

Validation action do not fail when UD features are wrong #540

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

https://github.com/clarin-eric/ParlaMint/actions/runs/3711782252/jobs/6293360360#step:4:261

2022-12-16T10:08:48.6338971Z INFO: Validating level 2: ParlaMint-ES-CT_2018-01-17-0101
2022-12-16T10:08:48.6339589Z [Line 5 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-upos] Invalid UPOS value 'adj'.
2022-12-16T10:08:48.6340246Z [Line 5 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho unknown-upos] Unknown UPOS tag: 'adj'.
2022-12-16T10:08:48.6341101Z [Line 5 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'gen=masculine'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6342057Z [Line 5 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'num=singular'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6343013Z [Line 5 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'type=qualificative'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6343756Z [Line 6 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-upos] Invalid UPOS value 'noun'.
2022-12-16T10:08:48.6344325Z [Line 6 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho unknown-upos] Unknown UPOS tag: 'noun'.
2022-12-16T10:08:48.6345117Z [Line 6 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'gen=masculine'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6346048Z [Line 6 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'num=singular'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6347023Z [Line 6 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'type=common'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6347740Z [Line 7 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-upos] Invalid UPOS value 'adp'.
2022-12-16T10:08:48.6348299Z [Line 7 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho unknown-upos] Unknown UPOS tag: 'adp'.
2022-12-16T10:08:48.6349104Z [Line 7 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'type=preposition'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6349830Z [Line 8 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-upos] Invalid UPOS value 'pron'.
2022-12-16T10:08:48.6350398Z [Line 8 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho unknown-upos] Unknown UPOS tag: 'pron'.
2022-12-16T10:08:48.6351182Z [Line 8 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'gen=common'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6352473Z [Line 8 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'num=singular'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6353703Z [Line 8 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-feature] Spurious morphological feature: 'type=indefinite'. Should be of the form Feature=Value and must start with [A-Z] and only contain [A-Za-z0-9].
2022-12-16T10:08:48.6354510Z [Line 9 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.0.1]: [L2 Morpho invalid-upos] Invalid UPOS value 'punct'.
2022-12-16T10:08:48.6354959Z ...suppressing further errors regarding Morpho
2022-12-16T10:08:48.6355504Z [Line 516 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.8.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6356370Z [Tree number 12 on line 506 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.8.1]: [L2 Syntax multiple-roots] Multiple root words: [2, 10]
2022-12-16T10:08:48.6357196Z [Tree number 12 on line 506 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.8.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6357890Z [Tree number 23 on line 947 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.2.1]: [L2 Syntax multiple-roots] Multiple root words: [4, 36]
2022-12-16T10:08:48.6358611Z [Tree number 23 on line 947 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.2.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6359357Z [Line 1470 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6360018Z [Line 1470 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6360708Z [Tree number 34 on line 1378 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Syntax multiple-roots] Multiple root words: [1, 3, 4]
2022-12-16T10:08:48.6361495Z [Tree number 34 on line 1378 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6362260Z [Tree number 38 on line 1613 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.6.1]: [L2 Syntax multiple-roots] Multiple root words: [4, 16]
2022-12-16T10:08:48.6363029Z [Tree number 38 on line 1613 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.6.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6363724Z [Line 3016 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.0.1]: [L2 Syntax head-self-loop] HEAD == ID for 28
2022-12-16T10:08:48.6364413Z [Tree number 76 on line 2989 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.0.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6365079Z [Line 3518 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.7.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6365707Z [Tree number 83 on line 3425 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.7.1]: [L2 Syntax multiple-roots] Multiple root words: [2, 93]
2022-12-16T10:08:48.6366417Z [Tree number 83 on line 3425 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.7.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6367051Z [Line 3584 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.0.1]: [L2 Syntax head-self-loop] HEAD == ID for 4
2022-12-16T10:08:48.6367725Z [Tree number 87 on line 3581 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.0.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6368394Z [Line 4126 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.17.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6369036Z [Tree number 104 on line 4114 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.17.1]: [L2 Syntax multiple-roots] Multiple root words: [2, 12]
2022-12-16T10:08:48.6369847Z [Tree number 104 on line 4114 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.17.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6370513Z [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6371106Z [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6371757Z [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6372331Z [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0.
2022-12-16T10:08:48.6372949Z [Tree number 150 on line 5479 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax multiple-roots] Multiple root words: [1, 2, 6, 8]
2022-12-16T10:08:48.6373661Z [Tree number 150 on line 5479 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6374349Z [Tree number 162 on line 5723 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.12.2]: [L2 Syntax multiple-roots] Multiple root words: [5, 10]
2022-12-16T10:08:48.6375062Z [Tree number 162 on line 5723 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.12.2]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure.
2022-12-16T10:08:48.6375488Z Format errors: 10
2022-12-16T10:08:48.6375721Z Morpho errors: 35984
2022-12-16T10:08:48.6375947Z Syntax errors: 19
2022-12-16T10:08:48.6376181Z *** FAILED *** with 36013 errors
matyaskopp commented 1 year ago

@TomazErjavec, currently, L2 validation produces just warnings, so if someone has wrong features, it does not fail. Do you think that Morpho errors should produce errors?

TomazErjavec commented 1 year ago

Do you think that Morpho errors should produce errors?

Yes, I do. Level 1 errors = errors, Level 2 errors = warnings would be my suggestion.

matyaskopp commented 1 year ago

ok, leaving it as it is now: image

But I think we can be stricter, at least for morphology. If someone provides a spaceless random mess in @msd, it shows only a warning.

TomazErjavec commented 1 year ago

But I think we can be stricter, at least for morphology.

Yes, I agree(d), I guess I wan't clear before. What I meant to say is that it morphology is not ok, that should be an error. If syntax is not ok, that should be probably just a warning.

matyaskopp commented 1 year ago

Done. We will see if it works once ES-CT is synced: https://github.com/IULATERM-TRL-UPF/ParlaMint/pull/3

rjzevallos commented 1 year ago

I hava a question about this validation. When I run "make conllu-ES-CT" I get some errors:

[Line 516 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.8.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Tree number 12 on line 506 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.8.1]: [L2 Syntax multiple-roots] Multiple root words: [2, 10] [Tree number 12 on line 506 Sent ParlaMint-ES-CT_2018-01-17-0101.1.0.8.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Tree number 23 on line 947 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.2.1]: [L2 Syntax multiple-roots] Multiple root words: [4, 36] [Tree number 23 on line 947 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.2.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Line 1470 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Line 1470 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Tree number 34 on line 1378 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Syntax multiple-roots] Multiple root words: [1, 3, 4] [Tree number 34 on line 1378 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.5.4]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Tree number 38 on line 1613 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.6.1]: [L2 Syntax multiple-roots] Multiple root words: [4, 16] [Tree number 38 on line 1613 Sent ParlaMint-ES-CT_2018-01-17-0101.2.0.6.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Line 3016 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.0.1]: [L2 Syntax head-self-loop] HEAD == ID for 28 [Tree number 76 on line 2989 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.0.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Line 3518 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.7.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Tree number 83 on line 3425 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.7.1]: [L2 Syntax multiple-roots] Multiple root words: [2, 93] [Tree number 83 on line 3425 Sent ParlaMint-ES-CT_2018-01-17-0101.3.0.7.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Line 3584 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.0.1]: [L2 Syntax head-self-loop] HEAD == ID for 4 [Tree number 87 on line 3581 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.0.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Line 4126 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.17.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Tree number 104 on line 4114 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.17.1]: [L2 Syntax multiple-roots] Multiple root words: [2, 12] [Tree number 104 on line 4114 Sent ParlaMint-ES-CT_2018-01-17-0101.5.0.17.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Line 5488 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax 0-is-not-root] DEPREL must be 'root' if HEAD is 0. [Tree number 150 on line 5479 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Syntax multiple-roots] Multiple root words: [1, 2, 6, 8] [Tree number 150 on line 5479 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.6.1]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. [Tree number 162 on line 5723 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.12.2]: [L2 Syntax multiple-roots] Multiple root words: [5, 10] [Tree number 162 on line 5723 Sent ParlaMint-ES-CT_2018-01-17-0101.16.0.12.2]: [L2 Format skipped-corrupt-tree] Skipping annotation tests because of corrupt tree structure. Format errors: 10 Syntax errors: 19 FAILED with 29 errors

How can I fix it?

TomazErjavec commented 1 year ago

No so simple to fix. As many annotation tools seem to produce these bugs, we have changed this to warning (see above), so, you could leave it as is. But if you want to fix it, see #474 for some suggestions and discussion.

matyaskopp commented 1 year ago

How can I fix it?

These kinds of errors are not deal-breaking errors. We do not insist on L2 syntax and L2 format validity, but these should be rare errors - only in some obscure sentences, not over the whole corpus.

I can see two possible solutions:

  1. use a better annotating tool
  2. reduce the number of errors by postprocessing. Replace root in the middle with dep relation and 0-is-not-root with root relation