clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

DK: failing CoNLLu validation in sample data #114

Closed matyaskopp closed 2 years ago

matyaskopp commented 2 years ago

This error appears also in the original log, that has been produced during ParlaMint sample creation. @TomazErjavec reported by email (2021-04-20):

There are still some mistakes in CoNLL-U, ... Also, sometimes the CoNLL-U validation dies, probably because a sentence is too long, just ignore that.

This error appears only in DK corpus. And probably can be fixed as ParlaMint-DK_20141008130437.seg2.7] does not look like a sentence. I guess it can be split into multiple segments. The source data(https://www.ft.dk/forhandlinger/20141/20141M002_2014-10-08_1300.htm) looks like multiple glued lists to me: image

error:

https://github.com/clarin-eric/ParlaMint/runs/4704545202?check_suite_focus=true#step:4:188

[Line 8029 Sent ParlaMint-DK_20141008130437.seg2.7]: [L0 Format some-test] Exception caught!
Traceback (most recent call last):
  File "/home/runner/work/ParlaMint/ParlaMint/ParlaMint/Scripts/tools/validate.py", line 2293, in <module>
    validate(inp,out,args,tagsets,known_sent_ids)
  File "/home/runner/work/ParlaMint/ParlaMint/ParlaMint/Scripts/tools/validate.py", line 1902, in validate
    tree = build_tree(sentence) # level 2 test: tree is single-rooted, connected, cycle-free
  File "/home/runner/work/ParlaMint/ParlaMint/ParlaMint/Scripts/tools/validate.py", line 1117, in build_tree
    get_projection(0, tree, projection)
  File "/home/runner/work/ParlaMint/ParlaMint/ParlaMint/Scripts/tools/validate.py", line 1135, in get_projection
    get_projection(child, tree, projection)
  File "/home/runner/work/ParlaMint/ParlaMint/ParlaMint/Scripts/tools/validate.py", line 1135, in get_projection
    get_projection(child, tree, projection)
  File "/home/runner/work/ParlaMint/ParlaMint/ParlaMint/Scripts/tools/validate.py", line 1135, in get_projection
    get_projection(child, tree, projection)
  [Previous line repeated 993 more times]
  File "/home/runner/work/ParlaMint/ParlaMint/ParlaMint/Scripts/tools/validate.py", line 1134, in get_projection
    projection.add(child)
RecursionError: maximum recursion depth exceeded while calling a Python object
Format errors: 1
*** FAILED *** with 1 errors
matyaskopp commented 2 years ago

The current version of CoNLLu validator doesn't make recursion depth exception: https://github.com/UniversalDependencies/tools/commit/67cdccbac56ffc8a3801b34e05e0fa9052031c9f

But I am still suggesting to consider this issue. @BartJongejan, if you don't want to fix it, close it.

BartJongejan commented 2 years ago

I have checked whether our segmenter/tokenizer program has done something unexpected. My and my colleagues' conclusion is that it didn't. So, for this particular 'sentence', the only option left is to manually split the sentence. We have decided not to do that, since it is an improvement with little value. Please close this issue. (I cannot do that, it seems.)

matyaskopp commented 2 years ago

Ok, thanks for checking it.