UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
203 stars 43 forks source link

bug in FixPunct? #21

Closed gossebouma closed 6 years ago

gossebouma commented 6 years ago

My input is test.conllu file contains:

# sent_id = wiki-11.p.10.s.1.xml
# text = (*)
1       (       (       PUNCT   LET     _       0       root    _       _
2       *       *       PUNCT   LET     _       1       punct   _       _
3       )       )       PUNCT   LET     _       1       punct   _       _

after running

udapy -s ud.FixPunct < test.conllu I get:

# sent_id = wiki-11.p.10.s.1.xml
# text = (*)
1   (   (   PUNCT   LET _   0   root    _   _
2   *   *   PUNCT   LET _   0   punct   _   _
3   )   )   PUNCT   LET _   0   punct   _   _

It is a pathological string, nevertheless, this is a nasty bug (it took me a while to realize it was not my conversion script that was wrong....)

martinpopel commented 6 years ago

Thanks for reporting this, it should be fixed now. I agree multiple roots is a more severe error than punctuation with children (and obviously we need to break one of these two rules in this sentence unless we change the tag of * from PUNCT to something else).

dan-zeman commented 6 years ago

BTW a PUNCT enclosed in paired punctuation (parentheses, quotes) occurs in other corpora, not necessarily as children of the root node. I think it would be natural to allow in these cases that the paired punctuation is attached to the symbol inside.