UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
205 stars 44 forks source link

validate.py does not pick up presentential comments #56

Closed beemorris closed 1 year ago

beemorris commented 4 years ago

Udpipe complains if there is a comment in front of a sentence, but the validator doesn't pick this up. Is this an issue with UDpipe (e.g. does the format allow pre-sentential comments) or is it an issue with the validator ? Here is an example:

# can't find this in common voice file
# sent_id = 349
# text = Uico a rak rang tuk.
# text_en = The dog was very fast.
1       Uico    uico    NOUN    _       _       4       nsubj   _       dog
2       a       a       PRON    _       Number=Sing|Person=3    4       expl    _       3SG
3       rak     rak     PART    _       _       4       discourse       _       PERF
4       rang    rang    VERB    _       _       0       root    _       fast
5       tuk     tuk     ADV     _       _       4       advmod  _       very|SpaceAfter=No
6       .       PUNCT   PUNCT   _       _       4       punct   _       _

# can't find this in common voice file
# sent_id = 350
# text = Mei a vun sen.
# text_en = The light immediately turned red.
1       Mei     mei     NOUN    _       _       4       nsubj   _       light
2       a       a       PRON    _       Number=Sing|Person=3    4       expl    _       3SG
3       vun     vun     ADV     _       _       4       advmod  _       immediately
4       sen     sen     VERB    _       _       0       root    _       red|SpaceAfter=No
5       .       PUNCT   PUNCT   _       _       4       punct   _       _

Here is the output from UDpipe:

[Line 1814 Sent 214]: [L1 Format misplaced-comment] Spurious comment line. Comments are only allowed before a sentence.
dan-zeman commented 4 years ago

Your examples look OK to me. All comments are presentential, i.e., must occur before the line of the first token of the sentence. See the CoNLL-U format specification. I think UDPipe can read comments.

Or did you mean by "presentential" the fact that the "sent_id" comment is not the first comment? But that is formally okay as well. There must be just one "sent_id" comment but its relative position to other comments is not prescribed.

dan-zeman commented 4 years ago

The error message you list (which btw looks quite like the output from validate.py :-)) could actually mean that the previous sentence was not followed with a blank line, hence the script thinks we are still reading the previous sentence and the current comment occurrs in the middle or at the end of the sentence.