UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Make it clearer that comments/metadata are not allowed between tokens #970

Closed rhdunn closed 1 year ago

rhdunn commented 1 year ago

The first paragraph of the CoNLL-U Format states:

with three types of lines: ... 3. Comment lines starting with hash (#).

That part does not explicitly forbid comments from appearing anywhere in the conllu file.

The "Sentence Boundaries and Comments" section contains the text:

Comment and metadata lines inside sentences (i.e., between the token lines) are disallowed.

This is tucked away at the end of the second paragraph in this section, so makes it harder to understand where comments apply, especially in light of the initial paragraph in the spec.

@jnivre This makes the usage in the Swedish treebanks invalid according to the format spec.

This should be updated to clarify the restriction at the start of the document and make the restriction sentence more prominent (at least as part of its own paragraph).

jnivre commented 1 year ago

Thanks for bringing this to my attention. This is clearly something that has been added at a later stage and I don’t know whether the validator enforces it. If this is indeed the current standard, then we have to fix the Swedish data. Personally, I think it is in the nature of comment lines that they can occur anywhere because they are unambiguously marked as such, but I am happy to bow to the majority.

Joakim

nschneid commented 1 year ago

Has any treebank adopted a MISC field for ad hoc human-readable comments, e.g. Note=... or Comment=...?

Occasionally, for example, I have the urge to explain a tricky/ambiguous annotation decision that affects a token or phrase.

martinpopel commented 1 year ago

the urge to explain a tricky/ambiguous annotation decision that affects a token or phrase

We could still use comment lines before the sentence for such urges, e.g. # Note = Words 42-44 could be also annotated as NOUN NOUN VERB. I would prefer this even if mid-sentence comments were allowed in CoNLL-U because I can export such notes into any tool or visualization and they still make sense (imagine a visualization of a dependency tree with caption below the whole tree "Note: the following three words could be also annotated as NOUN NOUN VERB." :-) ).

foxik commented 1 year ago

Thanks for bringing this to my attention. This is clearly something that has been added at a later stage and I don’t know whether the validator enforces it.

Not really, it has been part of the CoNLL-U specification since 2014 -- the commit that introduced it is 50e9e75649eef8a1b5687f920eb5f7dfc1fa5242.

When I was implementing UDPipe CoNLL-U reader/writer (~end of 2015), it was definitely part of the specs (I learnt it from this file and has been operating with that rule ever since). I suspect existing tools might handle such comments poorly. For example, UDPipe just ignores all comments inside sentences (it does not crash, but it does not copy them to the output).

dan-zeman commented 1 year ago

As far as I remember, comment lines were never allowed anywhere else than at the beginning of a sentence (and I personally do not feel any urge to change this practice; but maybe the documentation should be clarified).

I am pretty sure that no officially released treebank contains mid-sentence comment lines – the validator would not let them through:

[Line 7 Sent 1.104a]: [L1 Format misplaced-comment] Spurious comment line. Comments are only allowed before a sentence.
jnivre commented 1 year ago

Thanks, Milan. I stand corrected. :)

Joakim Skickat från min iPhone

dan-zeman commented 1 year ago

The first paragraph of the CoNLL-U Format states:

with three types of lines: ... 3. Comment lines starting with hash (#).

That part does not explicitly forbid comments from appearing anywhere in the conllu file.

The "Sentence Boundaries and Comments" section contains the text:

Comment and metadata lines inside sentences (i.e., between the token lines) are disallowed.

This is tucked away at the end of the second paragraph in this section, so makes it harder to understand where comments apply, especially in light of the initial paragraph in the spec.

The initial paragraph just listed the types of lines, while all the details including the placement of the lines were described later. I do not think there was any conflict in the wording. However, this issue is evidence that confusion was still possible, so I also modified the initial paragraph.