UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Proposal for a LineBreakAfter MISC annotation #967

Closed rhdunn closed 1 year ago

rhdunn commented 1 year ago

When marking up poetry (Edgar Allan Poe, Shakespeare, etc.) -- or similar passages such as those from The Bible -- it would be useful to indicate where line breaks occur to be able to reconstruct such formatting from the annotations only.

As such, I would like to propose a LineBreakAfter MISC annotation that can have values No (default) or Yes. This works similar to SpaceAfter, but indicates the presence of a line break instead of whitespace.

As an example, part of Edgar Allan Poe's "The Raven" could then be annotated as:

# text = Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore — While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door.
1   Once    _   _   _   _   _   _   _   _
2   upon    _   _   _   _   _   _   _   _
3   a   _   _   _   _   _   _   _   _
4   midnight    _   _   _   _   _   _   _   _
5   dreary  _   _   _   _   _   _   _   SpaceAfter=No
6   ,   _   _   _   _   _   _   _   _
7   while   _   _   _   _   _   _   _   _
8   I   _   _   _   _   _   _   _   _
9   pondered    _   _   _   _   _   _   _   SpaceAfter=No
10  ,   _   _   _   _   _   _   _   _
11  weak    _   _   _   _   _   _   _   _
12  and _   _   _   _   _   _   _   _
13  weary   _   _   _   _   _   _   _   SpaceAfter=No
14  ,   _   _   _   _   _   _   _   LineBreakAfter=Yes
15  Over    _   _   _   _   _   _   _   _
16  many    _   _   _   _   _   _   _   _
17  a   _   _   _   _   _   _   _   _
18  quaint  _   _   _   _   _   _   _   _
19  and _   _   _   _   _   _   _   _
20  curious _   _   _   _   _   _   _   _
21  volume  _   _   _   _   _   _   _   _
22  of  _   _   _   _   _   _   _   _
23  forgotten   _   _   _   _   _   _   _   _
24  lore    _   _   _   _   _   _   _   _
25  —   _   _   _   _   _   _   _   LineBreakAfter=Yes
26  While   _   _   _   _   _   _   _   _
27  I   _   _   _   _   _   _   _   _
28  nodded  _   _   _   _   _   _   _   SpaceAfter=No
29  ,   _   _   _   _   _   _   _   _
30  nearly  _   _   _   _   _   _   _   _
31  napping _   _   _   _   _   _   _   SpaceAfter=No
32  ,   _   _   _   _   _   _   _   _
33  suddenly    _   _   _   _   _   _   _   _
34  there   _   _   _   _   _   _   _   _
35  came    _   _   _   _   _   _   _   _
36  a   _   _   _   _   _   _   _   _
37  tapping _   _   _   _   _   _   _   SpaceAfter=No
38  ,   _   _   _   _   _   _   _   LineBreakAfter=Yes
39  As  _   _   _   _   _   _   _   _
40  of  _   _   _   _   _   _   _   _
41  some    _   _   _   _   _   _   _   _
42  one _   _   _   _   _   _   _   _
43  gently  _   _   _   _   _   _   _   _
44  rapping _   _   _   _   _   _   _   SpaceAfter=No
45  ,   _   _   _   _   _   _   _   _
46  rapping _   _   _   _   _   _   _   _
47  at  _   _   _   _   _   _   _   _
48  my  _   _   _   _   _   _   _   _
49  chamber _   _   _   _   _   _   _   _
50  door    _   _   _   _   _   _   _   SpaceAfter=No
51  .   _   _   _   _   _   _   _   _

This allows -- using the token data alone --- the reconstruction of the formatted text:

Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore —
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
nschneid commented 1 year ago

What about SpacesAfter=\n? https://universaldependencies.org/misc#spacesafter

amir-zeldes commented 1 year ago

Funnily enough, we actually have the Raven in UD_English-GENTLE 😁

https://github.com/UniversalDependencies/UD_English-GENTLE/blob/master/not-to-release/sources/GENTLE_poetry_raven.conllu

But I'm realizing now that we didn't encode the line breaks in the conllu format. In the underlying XML format we had them as unary XML tags:

https://github.com/gucorpling/gentle/blob/main/xml/GENTLE_poetry_raven.xml#L11

And we have used an XML misc comment to encode underlying XML, as documented here. So I suppose this would be another option:

1   THE the DET DT  Definite=Def|PronType=Art   2   det 2:det   XML=<l n:::1/>
2   RAVEN   Raven   PROPN   NNP Number=Sing 0   root    0:root  Entity=1)|SpaceAfter=No
3   .   .   PUNCT   .   _   2   punct   2:punct _

It's not as convenient for recovering whitespace in the plain text, but it does allow you to encode logical line and stanza numbers.

rhdunn commented 1 year ago

Interesting, thanks.

The SpacesAfter and XML annotations are encoding specific Markup -- e.g. you would need to know and understand the specific XML markup used, which would vary from document to document.

It may make more sense then to have a LineNumber=[number] annotation at the start of a new line. That would then avoid having to place LineBreakAfter=Yes at the end of some of the sentences. Then, the stanza number is loically part of the paragraph metadata comments after the newpar lines, or could have a specific Stanza=[number] annotation.

jnivre commented 1 year ago

Have you considered using comments to encode this information instead? Elements like lines and verses in poetry are in some ways analogous to documents and paragraphs, which are normally encoded using comments in CoNLL-U.

Joakim

nschneid commented 1 year ago

Have you considered using comments to encode this information instead? Elements like lines and verses in poetry are in some ways analogous to documents and paragraphs, which are normally encoded using comments in CoNLL-U.

Comments are only allowed at the sentence level, right? If a "sentence" contains multiple lines then this won't work - something in MISC would make sense.

I agree that structure above the level of a sentence (presumably stanza/verse numbers) should be indicated in comments.

amir-zeldes commented 1 year ago

I just realized we also have some poetry in UD_English-GUM (inside a larger interview), and we did just that - stanza in a comment, line in XML= in MISC:

https://github.com/UniversalDependencies/UD_English-GUM/blob/master/not-to-release/sources/GUM_interview_messina.conllu#L844-L845

(See the 'lg' and 'l' annotations below, with 'n' used for stanza and line numbers)

# sent_id = GUM_interview_messina-36
# s_prominence = 2
# s_type = q
# speaker = FrankMessina
# transition = null
# text = Do you know what it's like to be chased by the Ghost of Failure while staring through Victory's door?
# newpar
# newpar_block = p rend:::"indent" (9 s) | hi rend:::"italic" (9 s) | lg type:::"stanza" n:::"1" (2 s)
1   Do  do  AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   3   aux 3:aux   Discourse=attribution-positive:82->83:1|XML=<l n:::"1">
2   you you PRON    PRP Case=Nom|Number=Sing|Person=2|PronType=Prs  3   nsubj   3:nsubj Entity=(82-person-new-cf2-1-ana)
3   know    know    VERB    VB  VerbForm=Inf    0   root    0:root  _
4   what    what    PRON    WP  PronType=Int    3   ccomp   3:ccomp Discourse=topic-question:83->85:1
5-6 it's    _   _   _   _   _   _   _   _
5   it  it  PRON    PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  4   expl    4:expl  Entity=(83-event-new-cf1-1-cata)
6   's  be  AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4   cop 4:cop   _
7   like    like    ADP IN  _   4   case    4:case  XML=</l>
8   to  to  PART    TO  _   10  mark    10:mark Entity=(83-event-acc:com-cf1-3,8-coref|XML=<l n:::"2">
9   be  be  AUX VB  VerbForm=Inf    10  aux:pass    10:aux:pass _
10  chased  chase   VERB    VBN Tense=Past|VerbForm=Part|Voice=Pass 4   csubj   4:csubj _
11  by  by  ADP IN  _   13  case    13:case _
12  the the DET DT  Definite=Def|PronType=Art   13  det 13:det  Entity=(84-abstract-new-cf3-2,4-sgl
13  Ghost   Ghost   PROPN   NNP Number=Sing 10  obl:agent   10:obl:agent    _
14  of  of  ADP IN  _   15  case    15:case _
15  Failure Failure PROPN   NNP Number=Sing 13  nmod    13:nmod:of  Entity=(85-event-new-cf4-1-sgl)84)|XML=</l>
jnivre commented 1 year ago

No, comments can occur anywhere. We have a few cases of mid-sentence paragraph breaks in one of the Swedish treebanks.

Joakim

jnivre commented 1 year ago

No, comments can occur anywhere. We have a few cases of mid-sentence paragraph breaks in one of the Swedish treebanks.

Joakim

nschneid commented 1 year ago

Oh really? I have definitely written code that processes .conllu files assuming comments only before sentences.

ClaudiaCorbe commented 1 year ago

We have encountered the same issue while annotating the Divine Comedy. Currently, we are using MISC to indicate the verse and the Canto (we also made sure to signal the Canto in each # sent_id). We are aware that this might not be the perfect solution.

rhdunn commented 1 year ago

The format spec at the top states that lines starting with # are comment lines, and does not state where they appear.

The "Sentence Boundaries and Comments" section states that comment lines occurring before a sentence can contain metadata for the comment in addition to comment lines.

It does not state anything about token-level comments in the middle of a sentence, nor how to interpret document level (after newdoc), or paragraph level (after newpar) metadata. -- It would be useful to support these. It also does not define generic sentence metadata (only specific metadata fields) -- it is assumed that a metadata line is of the form # key = value, but that is not directly stated anywhere in the format spec. It would be nice to formalize these so that processors can handle and preserve the comments correctly.

jnivre commented 1 year ago

Like so many other things in UD, comment lines have developed to serve multiple purposes. As genuine comments, they have no pre-defined meaning, can occur anywhere, and can simply be skipped by linguistic processors. However, when used to encode specific phenomena like sentence ids and paragraph breaks, they need to be standardized with respect to form and (sometimes) position. It makes perfect sense that comments encoding sentence ids occur right before the sentence starts, because we expect a one-to-one correspondence between sentence ids. Paragraph boundaries, however, are different, because they do not always coincide with sentence boundaries (for example, in the case of certain types of lists or other displayed material).

martinpopel commented 1 year ago

comments can occur anywhere.

The official CoNLL-U specification does not mention the possibility of comment lines appearing elsewhere than before sentences, although it is not explicitly forbidden (as noted above by @rhdunn). I would thus consider files using such undocumented features a gray zone - they are not invalid, but they are not canonical. This means that there are no guarantees about what UD-compliant tools will do with such mid-sentence comments.

For example, Udapi accepts such comments in non-canonical positions without any warnings, but when storing the files, it fixes it into the canonical order (all comments precede the first token).

There are three options:

  1. Keep the status quo, i.e. mid-sentence comments being undocumented (gray zone).
  2. Explicitly forbid mid-sentence (and end-sentence) comments in the documentation and in the validator.
  3. Explicitly allow mid-sentence comments in the documentation and require all UD-compliant tools to keep the position of such comments when editing CoNLL-U files (the tools still do not need to interpret the content of such comments, unless the documentation defines some official/compulsory mid-sentence comments). Note that this would add extra burden on authors of all UD-compliant tools (add data structures for storing the position of each comment, decide what to do if some words preceding the comment are deleted etc.).

I am quite OK with options 1 and 2.

We have a few cases of mid-sentence paragraph breaks

According to the official guidelines on annotating paragraph boundaries in UD: "When a new paragraph starts between two tokens of a sentence, the first token of the new paragraph contains the attribute NewPar=Yes in the MISC column."

I would strongly prefer using this official way rather than undocumented mid-sentence comments for mid-sentence paragraph breaks. Similarly for encoding a mid-sentence newline, I would prefer the documented way SpacesAfter=\n. If there is a need for more elaborate annotation of poems (stanzas, rhymes,...), we can of course define other ways, but I would still stay with comment lines before a sentence and MISC attributes.

rhdunn commented 1 year ago

I've created #970 to discuss the comments/metadata line issue.

Stormur commented 1 year ago

@martinpopel

I could add that, in annotational practice, it has come out that a MISC value like (New)Par or Verse is much more handy to have repeated at each token/word if it is known to change mid-sentence, while values valid for a whole sentence (like Canto for Dante's Comedy) are manageable as sentence-comments.

martinpopel commented 1 year ago

in annotational practice, it has come out that a MISC value like (New)Par or Verse is much more handy to have repeated at each token/word if it is known to change mid-sentence

I think paragraphs starting in the middle of a sentence are very rare, usually in bullet lists, e.g. Buy

and bring it home.

BTW: it is questionable whether each bullet should be considered a paragraph. (Some people insist that paragraph boundaries imply sentence boundaries by definition, so they don't consider these bullets paragraphs.) I don't want to delve into this now.

So for paragraphs, I like the current option of sentence-level comment line (newpar id =) for the typical case and NewPar=Yes in MISC for the rare case.

Verses are a different story. Verses starting in the middle of a sentence are quite usual. So here I agree we can consider both options

I would still prefer the former because of my experience with annotational practice: imagine an annotator missed one verse boundary, so if you don't have a tool that knows about VerseNumber, you will have to renumber all the following verses manually - a nightmare. I admit @Stormur 's experience with annotational practice may be different.

Note also that some documents may have mixed prose and poetry, so we may need to annotate also the start and end of poem/stanza/verse... If such detailed annotation is needed, I agree with @amir-zeldes and UD_English-GENTLE that we may consider reusing the TEI annotation of poetry using the XML attribute in MISC instead of reinventing the wheel (UD-specific way of annotation not compatible with TEI).

rhdunn commented 1 year ago

The thing I don't like about reusing the TEI poetry annotation is that it is creating a mixed-mode/format environment where a processor needs to handle both CoNLL-U and XML to read the metadata. With a Property=Value MISC annotation, the metadata is in a form that is already handled by the parser, so the tools can then read/write the values easily and generate whatever output they want (TEI XML, HTML, text, etc.).

I'd like this not to become something like the entity annotations which are complex to parse and read, and contain multiple types of information.

For sections it may be useful to have a general # newsec [type]/# newsec [type] id = metadata field with type being one of chapter, part, book, volume, stanza, canto, verse, or any other treebank-defined string of the form [a-z]+. That should be flexible enough to encode different document structures. -- In the example of The Raven, that could then use newsec stanza instead of newpar, and a LineNumber, NewLine, LineBreakBefore, or LineBreakAfter to mark up the lines.

I like the idea behind using a NewLine=Yes annotation.