Open martinpopel opened 2 years ago
Yes, this annotation type is new in GUM, I recently pinged @dan-zeman for feedback on this and I'm waiting to hear back about his thoughts on this.
I don't particularly mind renaming this annotation and/or changing it's internal syntax in some way, but the root issue here is that the basic # newpar
annotation works like a "page-break" indicator: it just says that there is some kind of boundary at that point. However, paragraphs and similar elements, such as headings or bulleted lists, are actually block elements, which can have an extent of multiple sentences, and can nest:
<p>This is some paragraph: (contains a total of 15 sentences)
<list> (14 sentences remaining)
<item>... (3 sentences)</item>
<item><list> (the remaining 11 sentences in this huge paragraph)
<item> (5 sentences) </item>
<item> (6 sentences) </item>
</list></item>
</item>
</list>
</p>
So just saying # newpar
is insufficient, because when the second "item" opens, if we say # newpar
again, it looks like the first block has closed, when in fact it contains the second paragraph opening...
The current notation is not necessarily the most elegant solution and I'm happy to think about alternatives, but it has the advantage of allowing us to deterministically reconstruct the nested block structure of the original data.
Adding @lauren-lizzy-levine who implemented the current solution, in case you have any thoughts on this!
My 2 cents: List items aren't necessarily paragraphs. They can be construed as within a paragraph, or as containing paragraphs. Given the markup in your example, why not have separate directives for list items, e.g. # newlistitem
and # endlist
to express nesting structure? As long as # newpar
has a standard format will it be a problem to have these extra directives?
The primary issue here is that the current GUM newpar
annotation is not valid (based on my reading of the CoNLL-U specification). I understand the motivation for adding annotations from which the XML tags can be deterministically reconstructed and I understand that lists and paragraphs can be nested (both ways, as @nschneid pointed out).
The secondary issue is that XML tags spanning whole sentences are required to be annotated using newpar
. So what if there is a single sentence enclosed in a XML tag in the middle of a paragraph?
A simple solution for both of these issues would be to use different attribute than newpar
for the XML annotations, e.g. xml
. The annotation could remain exactly the same and newpar
would be used only when a new paragraph starts. Another solution would be to use the token-level XML annotations also for XML tags spanning whole sentences (which could make the reconstruction easier for the users as only one mechanism will need to be implemented).
There are several other issues with the current GUM XML annotation, which I did not plan to discuss here, but in the end why not.
colo<b>u</b>r
. I admit this is rare and can be ignored.<hi rend="italic">Always Keep Fighting</hi>
, it seems that opening XML tags should go before the token and closing tags after the token, but what about self-closing tags, such as <br/>
or <img/>
? (I guess the XML annotation should work for any XML, not only TEI.) And what if there is an empty non-self-closing element, i.e. the closing tag immediately following the opening tag (e.g. <img></img>
)? I think I can answer these questions:
# endlist
, but these would then have to appear at the end of the list (i.e. after the last sentence), which would be problematic if the list ends at the end of the document and there is no subsequent sentence (unless we allow trailing comment annotations, which we don't use so far)XML
, which is documented here. The annotation we are talking about here only applies to block elements which neatly nest sentencesXML
open before the token of the line where an opener appears, and close after the token of the line a closer appears on. See the explanation here# newpar
tag, as seen here with an extent of (0 s)
(so they encompass zero sentences)SpaceAfter
annotation represents the source text disregarding XML, and we assume that XML is just added around the text with zero additional spaces.Thanks for the explanations, Amir. <b>
I still don't understand how would you represent a tag around a single sentence inside a paragraph.</b>
Another question is about a list of <list><item>
one</item><item>
or more</item></list>
items inside a sentence.
For the first question we have many cases in the corpus - but the answer depends on whether it is a block tag which never breaks sentence hierarchy, or a coincidental one spanning a sentence, such as <b>
. In the first case, it would be added to the pipe notation of the GUM rendition of # newpar
:
# newpar = p (1 s) | someblocktag (1 s)
In the second case we would use the XML misc annotation, under the understanding that although in this case it coincidentally spans a whole sentence, this is not really a block tag, and therefore should not be included in newpar:
# newpar = p (1 s)
1 ... XML=<b>
...
23 XML =</b>
For the second question, the answer is perhaps less satisfying: if a tag has been designated as a 'block' by the scheme, then sentence boundaries may not cross it. In other words, if we decide that
* You will need:
* 200 g flour
* 1 cup water
* two forks
* a clean surface
It is likely that this decision was influenced by the early inclusion of how-to guides in the corpus, which sometimes have very long nested lists without overt coordination between the bullets, but you can also get weird paradoxes in textbooks, things like:
LEARNING OBJECTIVES
By the end of this section you will be able to:
* explain the principles of theory X. What are the main reasons why scholars assumed Y?
* locate the main factor in a Z diagram. How can we distinguish ABC?
* find ...
In this example, both bullets 1 and 2 include infinitives that look like complements of "able to", but there are intervening sentences that mess this up. So to avoid all these kinds of contortions, all bullets in GUM are always an independent sentence. However this does not apply to non-block tags, so other types of XML markup can occur sentence-medially. It does apply to headings, paragraphs, captions etc., so a sentence can never begin in a heading and carry on into the paragraph, even if that seems syntactically right (though I have no such example - I think <item>
is the only case I can recall, though most of them are truly separate syntagms anyway)
Thanks again for the explanations.
XML block tags spanning whole sentences ...
I missed the meaning of block when reading this for the first time. So if the XML tags are derived from HTML, we can distinguish block and inline elements. (GUM uses only TEI-derived XML tags, which can also be divided into block and inline.) Block elements can appear only at paragraph boundaries, or in other words paragraph boundaries are defined by presence of one or more block-level tags. Inline elements are always annotated using the XML=
in MISC, even if it spans a whole sentence. OK, this makes sense.
List items (<li>
), definition lists <dt>, <dd>
and <address>
may not necessarily always mark new sentence (and new paragraph), but I understand it is not always easy/possible to decide (esp. with such "weird paradoxes"). If the annotators decide that "You will need: * 200 g flour..." should be a single sentence, they could use the XML=
way for annotating the list structure.
So the only remaining question is the original one: what should we do?
newpar
.newpar
only for the original purpose and move the XML markup elsewhere:
# xml = list type:::"ordered" (4 s) | item n:::"1" (1 s)
XML=
).Yes, it's exactly as you described!
I don't feel passionately about which path to take, but I should perhaps explain why we didn't choose the second and third options you proposed: My initial instinct was to take the last one - represent everything in a single way using MISC and not bother with newpar
at all.
But the main reason we went with newpar
for the block elements in the end was out of respect for the CoNLL-U format - it already has a place intended for expressing paragraph transitions, so not including this information as intended and putting it somewhere else seemed to be going against the standard for no good reason, why should a uniform CoNLL-U reader have a hard time figuring out where paragraphs are in GUM?
The other question of whether to simplify (only list newpar) or add all of the information also seemed to point towards adding it, because in reality newpars actually represent nestable blocks, and this is something we want to push CoNLL-U to allow us to represent in the future.
Finally in terms of doing both (plain newpar AND represent blocks in more detail in XML), I would point out that double inclusion of information is never a good idea, since we can have corrupt, conflicting information (if there is XML <p>
but no newpar
, is there a paragraph break or not?). I'm also not sure how to represent zero-sentence blocks, such as images, which work well with the pipe notation (# newpar = figure (0 s) | p (10 s)
) but would require some other trick for the XML in MISC. I'm sure we would figure it out either way, but I for one would like a standard CoNLL-U way to express potentially nested newpar block extents.
From the name newpar
I would not expect the field to contain all information about nested block structure such as lists. I don't see a problem with using newpar
just for paragraph delimiters (for some document processing purposes, this may be sufficient), and other fields for richer block structure for those users who need it.
From my perspective the way to avoid corruption is with validation scripts, which GUM has anyway. :) If most UD treebanks had this rich XML structure it might be a different story, but assuming that it's just a few treebanks and they may have different kinds of XML, I would leave the standard fields like newpar
as simple/uniform as possible.
I proposed the newdoc
and newpar
sentence-level comments before releasing the data for the CoNLL 2017 parsing shared task (it was a bit in a rush, as I realized in the last minute that some corpora have this info and it can improve plain text generation from the data). The original idea was that a document is a sequence of paragraphs, and a paragraph is a sequence of sentences, no recursion. Then it became a bit more complicated when several people said they had "paragraph boundaries" inside sentences; in this perspective, you could read "paragraph boundary" as an obligatory line break.
Admittedly, the standard might look a bit different if it were a part of the original CoNLL-U specification and were discussed more thoroughly. But this is what we have, and it has been implemented in a number of treebanks in the meantime.
I think I prefer to leave newdoc
and newpar
as it is now. The need to encode a complex document structure does not seem to be a thing for most UD treebanks, so I would keep it separate. Remember, the UD standard is not meant to encode everything (although it tries to be flexible enough to allow arbitrary treebank-specific annotations where desired).
OK, it sounds like the consensus is that everyone would like to keep newpar
as a flat, milestone-style annotation that just indicates a split, so I can implement that and rename our more complex annotation something else.
That said, I have seen numerous occasions where people have used UD data to reconstruct plain text representations of datasets, so I think that UD should make a recommendation about how nested blocks like headings and item lists should be represented. That way people who can and want to preserve this information will be able to do so in a consistent way, and ultimately having this information is useful for parsing, tagging and more (e.g. being inside a heading totally alters probabilities for POS tags and trees, not to mention sentence splitting and tokenization issues).
For the solution that is adopted, however it looks, I would continue to argue that redundancies are potentially dangerous. Nathan is right that GUM has a build bot and validations which would prevent conflicts, but not everyone does, so for a future proof recommendation, building it the right way is still advisable.
The CoNLL-U specification says
However, GUM exploits the
newpar
lines for other kinds of markup such as# newpar = p (1 s)
or# newpar = list type:::"ordered" (4 s) | item n:::"1" (1 s)
as described in the README.Udapi follows strictly the CoNLL-U specification and allows only
^# (newpar|newdoc)(?:\s+id\s*=\s*(.+))?
. (Soudapy -s < in.conllu > out.conllu
results in deleting the extra markup and keeping just# newpar
.)So what should we do?
newpar
.newpar
for the XML annotations and keepnewpar
just for the original purpose. And possibly improve validate.py to checknewpar
using the above-mentioned regex.I would prefer the latter because there may be other toolkits (not only Udapi) or one-liners which expect
newpar
contains just the paragraph id (or nothing). Also, explaining the semantics of the XML-enhancednewpar
would make the CoNLL-U specification too long/complicated (and allowing it without explaining the semantics seems strange, although I admit there could be a link "see GUM docs for details").