Closed rhdunn closed 1 year ago
What about SpacesAfter=\n
? https://universaldependencies.org/misc#spacesafter
Funnily enough, we actually have the Raven in UD_English-GENTLE 😁
But I'm realizing now that we didn't encode the line breaks in the conllu format. In the underlying XML format we had them as unary XML tags:
https://github.com/gucorpling/gentle/blob/main/xml/GENTLE_poetry_raven.xml#L11
And we have used an XML
misc comment to encode underlying XML, as documented here. So I suppose this would be another option:
1 THE the DET DT Definite=Def|PronType=Art 2 det 2:det XML=<l n:::1/>
2 RAVEN Raven PROPN NNP Number=Sing 0 root 0:root Entity=1)|SpaceAfter=No
3 . . PUNCT . _ 2 punct 2:punct _
It's not as convenient for recovering whitespace in the plain text, but it does allow you to encode logical line and stanza numbers.
Interesting, thanks.
The SpacesAfter and XML annotations are encoding specific Markup -- e.g. you would need to know and understand the specific XML markup used, which would vary from document to document.
It may make more sense then to have a LineNumber=[number]
annotation at the start of a new line. That would then avoid having to place LineBreakAfter=Yes
at the end of some of the sentences. Then, the stanza number is loically part of the paragraph metadata comments after the newpar
lines, or could have a specific Stanza=[number]
annotation.
Have you considered using comments to encode this information instead? Elements like lines and verses in poetry are in some ways analogous to documents and paragraphs, which are normally encoded using comments in CoNLL-U.
Joakim
Have you considered using comments to encode this information instead? Elements like lines and verses in poetry are in some ways analogous to documents and paragraphs, which are normally encoded using comments in CoNLL-U.
Comments are only allowed at the sentence level, right? If a "sentence" contains multiple lines then this won't work - something in MISC would make sense.
I agree that structure above the level of a sentence (presumably stanza/verse numbers) should be indicated in comments.
I just realized we also have some poetry in UD_English-GUM (inside a larger interview), and we did just that - stanza in a comment, line in XML=
in MISC:
(See the 'lg' and 'l' annotations below, with 'n' used for stanza and line numbers)
# sent_id = GUM_interview_messina-36
# s_prominence = 2
# s_type = q
# speaker = FrankMessina
# transition = null
# text = Do you know what it's like to be chased by the Ghost of Failure while staring through Victory's door?
# newpar
# newpar_block = p rend:::"indent" (9 s) | hi rend:::"italic" (9 s) | lg type:::"stanza" n:::"1" (2 s)
1 Do do AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 3 aux 3:aux Discourse=attribution-positive:82->83:1|XML=<l n:::"1">
2 you you PRON PRP Case=Nom|Number=Sing|Person=2|PronType=Prs 3 nsubj 3:nsubj Entity=(82-person-new-cf2-1-ana)
3 know know VERB VB VerbForm=Inf 0 root 0:root _
4 what what PRON WP PronType=Int 3 ccomp 3:ccomp Discourse=topic-question:83->85:1
5-6 it's _ _ _ _ _ _ _ _
5 it it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 4 expl 4:expl Entity=(83-event-new-cf1-1-cata)
6 's be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop 4:cop _
7 like like ADP IN _ 4 case 4:case XML=</l>
8 to to PART TO _ 10 mark 10:mark Entity=(83-event-acc:com-cf1-3,8-coref|XML=<l n:::"2">
9 be be AUX VB VerbForm=Inf 10 aux:pass 10:aux:pass _
10 chased chase VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 4 csubj 4:csubj _
11 by by ADP IN _ 13 case 13:case _
12 the the DET DT Definite=Def|PronType=Art 13 det 13:det Entity=(84-abstract-new-cf3-2,4-sgl
13 Ghost Ghost PROPN NNP Number=Sing 10 obl:agent 10:obl:agent _
14 of of ADP IN _ 15 case 15:case _
15 Failure Failure PROPN NNP Number=Sing 13 nmod 13:nmod:of Entity=(85-event-new-cf4-1-sgl)84)|XML=</l>
No, comments can occur anywhere. We have a few cases of mid-sentence paragraph breaks in one of the Swedish treebanks.
Joakim
No, comments can occur anywhere. We have a few cases of mid-sentence paragraph breaks in one of the Swedish treebanks.
Joakim
Oh really? I have definitely written code that processes .conllu files assuming comments only before sentences.
We have encountered the same issue while annotating the Divine Comedy. Currently, we are using MISC to indicate the verse and the Canto (we also made sure to signal the Canto in each # sent_id). We are aware that this might not be the perfect solution.
The format spec at the top states that lines starting with #
are comment lines, and does not state where they appear.
The "Sentence Boundaries and Comments" section states that comment lines occurring before a sentence can contain metadata for the comment in addition to comment lines.
It does not state anything about token-level comments in the middle of a sentence, nor how to interpret document level (after newdoc
), or paragraph level (after newpar
) metadata. -- It would be useful to support these. It also does not define generic sentence metadata (only specific metadata fields) -- it is assumed that a metadata line is of the form # key = value
, but that is not directly stated anywhere in the format spec. It would be nice to formalize these so that processors can handle and preserve the comments correctly.
Like so many other things in UD, comment lines have developed to serve multiple purposes. As genuine comments, they have no pre-defined meaning, can occur anywhere, and can simply be skipped by linguistic processors. However, when used to encode specific phenomena like sentence ids and paragraph breaks, they need to be standardized with respect to form and (sometimes) position. It makes perfect sense that comments encoding sentence ids occur right before the sentence starts, because we expect a one-to-one correspondence between sentence ids. Paragraph boundaries, however, are different, because they do not always coincide with sentence boundaries (for example, in the case of certain types of lists or other displayed material).
comments can occur anywhere.
The official CoNLL-U specification does not mention the possibility of comment lines appearing elsewhere than before sentences, although it is not explicitly forbidden (as noted above by @rhdunn). I would thus consider files using such undocumented features a gray zone - they are not invalid, but they are not canonical. This means that there are no guarantees about what UD-compliant tools will do with such mid-sentence comments.
For example, Udapi accepts such comments in non-canonical positions without any warnings, but when storing the files, it fixes it into the canonical order (all comments precede the first token).
There are three options:
I am quite OK with options 1 and 2.
We have a few cases of mid-sentence paragraph breaks
According to the official guidelines on annotating paragraph boundaries in UD: "When a new paragraph starts between two tokens of a sentence, the first token of the new paragraph contains the attribute NewPar=Yes
in the MISC column."
I would strongly prefer using this official way rather than undocumented mid-sentence comments for mid-sentence paragraph breaks. Similarly for encoding a mid-sentence newline, I would prefer the documented way SpacesAfter=\n
. If there is a need for more elaborate annotation of poems (stanzas, rhymes,...), we can of course define other ways, but I would still stay with comment lines before a sentence and MISC attributes.
I've created #970 to discuss the comments/metadata line issue.
@martinpopel
I could add that, in annotational practice, it has come out that a MISC value like (New)Par
or Verse
is much more handy to have repeated at each token/word if it is known to change mid-sentence, while values valid for a whole sentence (like Canto
for Dante's Comedy) are manageable as sentence-comments.
in annotational practice, it has come out that a MISC value like (New)Par or Verse is much more handy to have repeated at each token/word if it is known to change mid-sentence
I think paragraphs starting in the middle of a sentence are very rare, usually in bullet lists, e.g. Buy
and bring it home.
BTW: it is questionable whether each bullet should be considered a paragraph. (Some people insist that paragraph boundaries imply sentence boundaries by definition, so they don't consider these bullets paragraphs.) I don't want to delve into this now.
So for paragraphs, I like the current option of sentence-level comment line (newpar id =
) for the typical case and NewPar=Yes
in MISC for the rare case.
Verses are a different story. Verses starting in the middle of a sentence are quite usual. So here I agree we can consider both options
NewVerse=Yes
) orVerseNumber=42
(for all the words in the 42nd verse).I would still prefer the former because of my experience with annotational practice: imagine an annotator missed one verse boundary, so if you don't have a tool that knows about VerseNumber
, you will have to renumber all the following verses manually - a nightmare. I admit @Stormur 's experience with annotational practice may be different.
Note also that some documents may have mixed prose and poetry, so we may need to annotate also the start and end of poem/stanza/verse... If such detailed annotation is needed, I agree with @amir-zeldes and UD_English-GENTLE that we may consider reusing the TEI annotation of poetry using the XML attribute in MISC instead of reinventing the wheel (UD-specific way of annotation not compatible with TEI).
The thing I don't like about reusing the TEI poetry annotation is that it is creating a mixed-mode/format environment where a processor needs to handle both CoNLL-U and XML to read the metadata. With a Property=Value
MISC annotation, the metadata is in a form that is already handled by the parser, so the tools can then read/write the values easily and generate whatever output they want (TEI XML, HTML, text, etc.).
I'd like this not to become something like the entity annotations which are complex to parse and read, and contain multiple types of information.
For sections it may be useful to have a general # newsec [type]
/# newsec [type] id =
metadata field with type
being one of chapter
, part
, book
, volume
, stanza
, canto
, verse
, or any other treebank-defined string of the form [a-z]+
. That should be flexible enough to encode different document structures. -- In the example of The Raven, that could then use newsec stanza
instead of newpar
, and a LineNumber
, NewLine
, LineBreakBefore
, or LineBreakAfter
to mark up the lines.
I like the idea behind using a NewLine=Yes
annotation.
When marking up poetry (Edgar Allan Poe, Shakespeare, etc.) -- or similar passages such as those from The Bible -- it would be useful to indicate where line breaks occur to be able to reconstruct such formatting from the annotations only.
As such, I would like to propose a
LineBreakAfter
MISC annotation that can have valuesNo
(default) orYes
. This works similar toSpaceAfter
, but indicates the presence of a line break instead of whitespace.As an example, part of Edgar Allan Poe's "The Raven" could then be annotated as:
This allows -- using the token data alone --- the reconstruction of the formatted text: