andreasvc / disco-dop

Discontinuous Data-Oriented Parsing
http://discodop.readthedocs.io
GNU General Public License v2.0
46 stars 16 forks source link

Alpino tree, Value error when handling secondary edges #66

Closed kilian-gebhardt closed 5 years ago

kilian-gebhardt commented 5 years ago

When converting the attached tree from the Lassy corpus (or drawing with secondary edges), I obtain the following error:

discodop treetransforms --inputfmt=alpino WR-P-E-I-0000015007/WR-P-E-I-0000015007.p.1.s.114.xml /tmp/output.export
Traceback (most recent call last):
  File "/home/kilian/miniconda3/envs/disco-dop-upstream/bin/discodop", line 11, in <module>
    load_entry_point('disco-dop==0.6a0', 'console_scripts', 'discodop')()
  File "/home/kilian/.local/lib/python3.7/site-packages/disco_dop-0.6a0-py3.7-linux-x86_64.egg/discodop/cli.py", line 42, in main
    globals()[cmd]()
  File "/home/kilian/.local/lib/python3.7/site-packages/disco_dop-0.6a0-py3.7-linux-x86_64.egg/discodop/cli.py", line 356, in treetransforms
    fmt, comment=item.comment, sentid=sentid))
  File "/home/kilian/.local/lib/python3.7/site-packages/disco_dop-0.6a0-py3.7-linux-x86_64.egg/discodop/treebank.py", line 573, in writetree
    result = writeexporttree(tree, sent, key, comment, morphology)
  File "/home/kilian/.local/lib/python3.7/site-packages/disco_dop-0.6a0-py3.7-linux-x86_64.egg/discodop/treebank.py", line 627, in writeexporttree
    secedges.append('%d' % (500 + nodeidindex.index(int(pid))))
ValueError: 39 is not in list

WR-P-E-I-0000015007.p.1.s.114.xml.txt

andreasvc commented 5 years ago

Thanks for the report. The problem is that the secondary edge refers to a node that does not dominate any terminals. I have to decide how to handle such cases. The obvious workaround is to ignore such secondary edges, because such empty nodes are filtered out as well (line 620). Additionally, in this sentence, the annotation looks off to me.

andreasvc commented 5 years ago

Here's the tree. Basically, two adjectives modifying two nouns, in all combinations ...

WR-P-E-I-0000015007 p 1 s 114 xml

kilian-gebhardt commented 5 years ago

Due to my limited knowledge of Dutch it is hard for me to judge this annotation. So if I understand correctly, there are "jewish wounded ones", "jewish killed ones", "arabic wounded ones", and "arabic killed ones". I have not yet looked how discodop decides on the primary annotation. Maybe an annotation similar to this one from TiGer is more suitable, but perhaps hard to produce automatically during conversion.

I think it is reasonable (and preferable to throwing an exception) to drop empty categories and secondary edges referring to them (and maybe output a warning during the conversion).

andreasvc commented 5 years ago

I went for the pragmatic solution of ignoring these problematic secondary edges, and printing a warning. I checked and there are 6 sentences in Lassy Small with this issue.

The assumption that every node must dominate one or more terminals is baked into all of the code (I think this is also assumed by the Negra export format). Therefore it would be rather difficult to preserve these annotations exactly.

kilian-gebhardt commented 5 years ago

Thanks.

The assumption that every node must dominate one or more terminals is baked into all of the code (I think this is also assumed by the Negra export format). Therefore it would be rather difficult to preserve these annotations exactly.

I cannot directly find such a restriction in the NeGra export format documentation, which apparently also does not state that the structures need to be rooted. Technically it is possible to have phrasal nodes in the export format that do not have child nodes. However, I agree that supporting this in discodop is not worth the effort – no one makes use of secondary edges as far as I know.

andreasvc commented 5 years ago

See Skut et al 1997, https://arxiv.org/pdf/cmp-lg/9702004.pdf: "The following features of our formalism are then of particular importance: [...] complete absence of empty categories" Perhaps one could argue that the restriction is not part of the Negra export format as such but only a design goal of the Negra annotation scheme.