newpar annotation does not follow the CoNLL-U specification

UniversalDependencies / UD_English-GUM

Other

32 stars 4 forks source link

newpar annotation does not follow the CoNLL-U specification #42

Open martinpopel opened 2 years ago

martinpopel commented 2 years ago

When a paragraph starts at sentence boundary, the first sentence of the paragraph contains a comment that says # newpar, which can be optionally followed by a paragraph id (newpar id = wsj2012-01-05-p1).

However, GUM exploits the newpar lines for other kinds of markup such as # newpar = p (1 s) or # newpar = list type:::"ordered" (4 s) | item n:::"1" (1 s) as described in the README.

Udapi follows strictly the CoNLL-U specification and allows only ^# (newpar|newdoc)(?:\s+id\s*=\s*(.+))?. (So udapy -s < in.conllu > out.conllu results in deleting the extra markup and keeping just # newpar.)

So what should we do?

Edit the CoNLL-U specification (and Udapi implementation) to allow XML markup in newpar.
In the future GUM releases, choose a different attribute than newpar for the XML annotations and keep newpar just for the original purpose. And possibly improve validate.py to check newpar using the above-mentioned regex.

I would prefer the latter because there may be other toolkits (not only Udapi) or one-liners which expect newpar contains just the paragraph id (or nothing). Also, explaining the semantics of the XML-enhanced newpar would make the CoNLL-U specification too long/complicated (and allowing it without explaining the semantics seems strange, although I admit there could be a link "see GUM docs for details").

amir-zeldes commented 2 years ago

Yes, this annotation type is new in GUM, I recently pinged @dan-zeman for feedback on this and I'm waiting to hear back about his thoughts on this.

I don't particularly mind renaming this annotation and/or changing it's internal syntax in some way, but the root issue here is that the basic # newpar annotation works like a "page-break" indicator: it just says that there is some kind of boundary at that point. However, paragraphs and similar elements, such as headings or bulleted lists, are actually block elements, which can have an extent of multiple sentences, and can nest:

<p>This is some paragraph: (contains a total of 15 sentences)
  <list> (14 sentences remaining)
    <item>... (3 sentences)</item>
    <item><list> (the remaining 11 sentences in this huge paragraph)
      <item> (5 sentences) </item>
      <item> (6 sentences) </item>
    </list></item>
  </item>
  </list>
</p>

So just saying # newpar is insufficient, because when the second "item" opens, if we say # newpar again, it looks like the first block has closed, when in fact it contains the second paragraph opening...

The current notation is not necessarily the most elegant solution and I'm happy to think about alternatives, but it has the advantage of allowing us to deterministically reconstruct the nested block structure of the original data.

Adding @lauren-lizzy-levine who implemented the current solution, in case you have any thoughts on this!

nschneid commented 2 years ago

My 2 cents: List items aren't necessarily paragraphs. They can be construed as within a paragraph, or as containing paragraphs. Given the markup in your example, why not have separate directives for list items, e.g. # newlistitem and # endlist to express nesting structure? As long as # newpar has a standard format will it be a problem to have these extra directives?

martinpopel commented 2 years ago

The primary issue here is that the current GUM newpar annotation is not valid (based on my reading of the CoNLL-U specification). I understand the motivation for adding annotations from which the XML tags can be deterministically reconstructed and I understand that lists and paragraphs can be nested (both ways, as @nschneid pointed out).

The secondary issue is that XML tags spanning whole sentences are required to be annotated using newpar. So what if there is a single sentence enclosed in a XML tag in the middle of a paragraph?

A simple solution for both of these issues would be to use different attribute than newpar for the XML annotations, e.g. xml. The annotation could remain exactly the same and newpar would be used only when a new paragraph starts. Another solution would be to use the token-level XML annotations also for XML tags spanning whole sentences (which could make the reconstruction easier for the users as only one mechanism will need to be implemented).

There are several other issues with the current GUM XML annotation, which I did not plan to discuss here, but in the end why not.

It does not allow to reconstruct XML tags inside tokens, e.g. colour. I admit this is rare and can be ignored.
I don't understand how the current token-level XML annotation distinguishes whether a given XML tag is before or after a given token. Based on the example with <hi rend="italic">Always Keep Fighting</hi>, it seems that opening XML tags should go before the token and closing tags after the token, but what about self-closing tags, such as   or <img/>? (I guess the XML annotation should work for any XML, not only TEI.) And what if there is an empty non-self-closing element, i.e. the closing tag immediately following the opening tag (e.g. <img></img>)?
What if there is a space after the opening tag or before the closing tag? Can this be represented and reconstructed?

amir-zeldes commented 2 years ago

I think I can answer these questions:

Yes, it is absolutely possible for items to nest multiple paragraphs, which nest other lists, etc. etc. Here is an example, and as you can see the current format is adequate for expressing this
We thought about things like # endlist, but these would then have to appear at the end of the list (i.e. after the last sentence), which would be problematic if the list ends at the end of the document and there is no subsequent sentence (unless we allow trailing comment annotations, which we don't use so far)
For non-block XML tags, which can appear in the middle of a sentence as @martinpopel pointed out, we use a MISC annotation called XML, which is documented here. The annotation we are talking about here only applies to block elements which neatly nest sentences
@martinpopel is correct that we have no way of representing token internal markup, which is fairly rare; there are maybe 5-10 cases in the corpus, and they are represented semi-informally like this
What you say is correct, opening tags in MISC XML open before the token of the line where an opener appears, and close after the token of the line a closer appears on. See the explanation here
Empty or milestone tags, such as an empty image without a caption (otherwise it surrounds the caption in GUM) do occur, but are always before or after some sentence, so they are encoded in the current # newpar tag, as seen here with an extent of (0 s) (so they encompass zero sentences)
While I agree that it would be nice to be able to express any XML and not only TEI, the notation we use does not have that aspiration yet, since XML is truly vast (thinking about xpointers, processing instructions, XML comments...). Right now we are only concerned with representing XML as used in GUM, which does probably cover most manually edited use cases of XML to encode document structure (but certainly not any random XML fragment from the Internet). GUM's XML vocabulary is closed (see here) and validated via .xsd, and we do a round-trip conversion test to validate the conllu representation, so we can be reasonably confident that it is adequate for now
We do not encode XML whitespace before/after tags on top of linguistic whitespace. In other words, the SpaceAfter annotation represents the source text disregarding XML, and we assume that XML is just added around the text with zero additional spaces.

martinpopel commented 2 years ago

Thanks for the explanations, Amir. I still don't understand how would you represent a tag around a single sentence inside a paragraph. Another question is about a list of <list><item>one</item><item>or more</item></list> items inside a sentence.

amir-zeldes commented 2 years ago

For the first question we have many cases in the corpus - but the answer depends on whether it is a block tag which never breaks sentence hierarchy, or a coincidental one spanning a sentence, such as . In the first case, it would be added to the pipe notation of the GUM rendition of # newpar:

# newpar = p (1 s) | someblocktag (1 s)

In the second case we would use the XML misc annotation, under the understanding that although in this case it coincidentally spans a whole sentence, this is not really a block tag, and therefore should not be included in newpar:

# newpar = p (1 s)
1 ... XML=<b>
...
23 XML =</b>

For the second question, the answer is perhaps less satisfying: if a tag has been designated as a 'block' by the scheme, then sentence boundaries may not cross it. In other words, if we decide that is a block (and in GUM it is), then even if there is a syntactic structure that can be interpreted as a sentence spanning it, we split it into multiple sentences. This was done in the interest of consistency, since it is often hard to decide if list items forms one sentence or an enumeration of fragments - GUM consistently takes the latter view, e.g. this would be 5 sentences in GUM, even though we could consider the sublist to be an object of the verb 'need':

  * You will need:
    * 200 g flour
    * 1 cup water
    * two forks
    * a clean surface

It is likely that this decision was influenced by the early inclusion of how-to guides in the corpus, which sometimes have very long nested lists without overt coordination between the bullets, but you can also get weird paradoxes in textbooks, things like:

LEARNING OBJECTIVES 

By the end of this section you will be able to:

  * explain the principles of theory X. What are the main reasons why scholars assumed Y?
  * locate the main factor in a Z diagram. How can we distinguish ABC?
  * find ...

In this example, both bullets 1 and 2 include infinitives that look like complements of "able to", but there are intervening sentences that mess this up. So to avoid all these kinds of contortions, all bullets in GUM are always an independent sentence. However this does not apply to non-block tags, so other types of XML markup can occur sentence-medially. It does apply to headings, paragraphs, captions etc., so a sentence can never begin in a heading and carry on into the paragraph, even if that seems syntactically right (though I have no such example - I think <item> is the only case I can recall, though most of them are truly separate syntagms anyway)

martinpopel commented 2 years ago

Thanks again for the explanations.

XML block tags spanning whole sentences ...

I missed the meaning of block when reading this for the first time. So if the XML tags are derived from HTML, we can distinguish block and inline elements. (GUM uses only TEI-derived XML tags, which can also be divided into block and inline.) Block elements can appear only at paragraph boundaries, or in other words paragraph boundaries are defined by presence of one or more block-level tags. Inline elements are always annotated using the XML= in MISC, even if it spans a whole sentence. OK, this makes sense.

List items (<li>), definition lists <dt>, <dd> and <address> may not necessarily always mark new sentence (and new paragraph), but I understand it is not always easy/possible to decide (esp. with such "weird paradoxes"). If the annotators decide that "You will need: * 200 g flour..." should be a single sentence, they could use the XML= way for annotating the list structure.

So the only remaining question is the original one: what should we do?

Edit the CoNLL-U specification to allow XML markup in newpar.
Keep newpar only for the original purpose and move the XML markup elsewhere:
- another sentence-level CoNLL-U comment for block-level tags, e.g. # xml = list type:::"ordered" (4 s) | item n:::"1" (1 s)
- don't distinguish block-level and inline tags, keep both in MISC (XML=).

amir-zeldes commented 2 years ago

Yes, it's exactly as you described!

I don't feel passionately about which path to take, but I should perhaps explain why we didn't choose the second and third options you proposed: My initial instinct was to take the last one - represent everything in a single way using MISC and not bother with newpar at all.

But the main reason we went with newpar for the block elements in the end was out of respect for the CoNLL-U format - it already has a place intended for expressing paragraph transitions, so not including this information as intended and putting it somewhere else seemed to be going against the standard for no good reason, why should a uniform CoNLL-U reader have a hard time figuring out where paragraphs are in GUM?

The other question of whether to simplify (only list newpar) or add all of the information also seemed to point towards adding it, because in reality newpars actually represent nestable blocks, and this is something we want to push CoNLL-U to allow us to represent in the future.

Finally in terms of doing both (plain newpar AND represent blocks in more detail in XML), I would point out that double inclusion of information is never a good idea, since we can have corrupt, conflicting information (if there is XML  but no newpar, is there a paragraph break or not?). I'm also not sure how to represent zero-sentence blocks, such as images, which work well with the pipe notation (# newpar = figure (0 s) | p (10 s)) but would require some other trick for the XML in MISC. I'm sure we would figure it out either way, but I for one would like a standard CoNLL-U way to express potentially nested newpar block extents.

nschneid commented 2 years ago

From the name newpar I would not expect the field to contain all information about nested block structure such as lists. I don't see a problem with using newpar just for paragraph delimiters (for some document processing purposes, this may be sufficient), and other fields for richer block structure for those users who need it.

nschneid commented 2 years ago

From my perspective the way to avoid corruption is with validation scripts, which GUM has anyway. :) If most UD treebanks had this rich XML structure it might be a different story, but assuming that it's just a few treebanks and they may have different kinds of XML, I would leave the standard fields like newpar as simple/uniform as possible.

dan-zeman commented 2 years ago

I proposed the newdoc and newpar sentence-level comments before releasing the data for the CoNLL 2017 parsing shared task (it was a bit in a rush, as I realized in the last minute that some corpora have this info and it can improve plain text generation from the data). The original idea was that a document is a sequence of paragraphs, and a paragraph is a sequence of sentences, no recursion. Then it became a bit more complicated when several people said they had "paragraph boundaries" inside sentences; in this perspective, you could read "paragraph boundary" as an obligatory line break.

Admittedly, the standard might look a bit different if it were a part of the original CoNLL-U specification and were discussed more thoroughly. But this is what we have, and it has been implemented in a number of treebanks in the meantime.

I think I prefer to leave newdoc and newpar as it is now. The need to encode a complex document structure does not seem to be a thing for most UD treebanks, so I would keep it separate. Remember, the UD standard is not meant to encode everything (although it tries to be flexible enough to allow arbitrary treebank-specific annotations where desired).

amir-zeldes commented 2 years ago

OK, it sounds like the consensus is that everyone would like to keep newpar as a flat, milestone-style annotation that just indicates a split, so I can implement that and rename our more complex annotation something else.

That said, I have seen numerous occasions where people have used UD data to reconstruct plain text representations of datasets, so I think that UD should make a recommendation about how nested blocks like headings and item lists should be represented. That way people who can and want to preserve this information will be able to do so in a consistent way, and ultimately having this information is useful for parsing, tagging and more (e.g. being inside a heading totally alters probabilities for POS tags and trees, not to mention sentence splitting and tokenization issues).

For the solution that is adopted, however it looks, I would continue to argue that redundancies are potentially dangerous. Nathan is right that GUM has a build bot and validations which would prevent conflicts, but not everyone does, so for a future proof recommendation, building it the right way is still advisable.