Paragraph and document boundaries

dan-zeman commented 7 years ago

UD guidelines currently do not specify how to mark document and paragraph boundaries and for many treebanks such information is not available (original text gone, sentences shuffled etc.) But where it is available, it can be potentially useful for applications, including but not limited to sentence segmentation.

I am going to acquire this sort of annotation from data providers who have it, and make it available in UD release 2.0 in a unified way. This issue is to propose the way it is encoded in the data, and see if there are comments/suggestions (hopefully quickly solvable ones—the data freeze deadline is coming soon).

It turns out that paragraphs are not always necessarily supersets of sentences. In some cases (bulleted list items), a new paragraph may start in the middle of a sentence. (That of course depends on how the two units are defined but I am not looking for any standardized definition on such a short notice. If you have paragraphs, you define what they are.)

As a result, paragraph boundary should be marked at token level. Document boundary will be marked at sentence level. My proposal:

The first sentence of a new document contains a comment that says # newdoc. (Nothing more. You can have a separate comment with a document id if you like, but I want to be able to recognize documents even if they don't have ids.) It is not necessary that the first sentence of a CoNLL-U file has the newdoc comment (e.g. if the document is split between dev and test data).
The first token of a new paragraph contains the attribute NewPar=Yes in the MISC column. Usually this will also be the first token of a sentence. If it is a multi-word token, the attribute will appear in the line of the multi-word token, not in the line of its first syntactic word.

Specifically seeking feedback from those who indicated they have doc or par info available: @lauma @liljao @jnivre @arademaker @Kira-D @kajad @TomazErjavec @LarsAhrenberg @natko5 and from those who may have to deal with the data :-) @foxik @martinpopel @fginter @spyysalo

(If you already have the info in the data in some form, e.g. inferrable from sentence ids, and prefer me to extract it and convert it to the unified annotation, let me know.)

martinpopel commented 7 years ago

I am not sure if there are more treebanks where sentences span over more paragraphs and the paragraph boundaries are annotated. In UD_Czech-CLTT original data, the boundaries were detected (and sub-sentence segments annotated independently, see this prezentation) and encoded in node IDs, but this has been lost when converting to UD (where node IDs is just an integer). I am not sure if it is worth the work to re-convert the treebank.

jnivre commented 7 years ago

I guess if people are unhappy about having paragraph information at the token level, we could make a distinction between intersentential paragraph boundaries, which would be annotated at the sentence level just like the document boundaries, and intrasentential paragraph boundaries, which (where they exist) would be annotated at the token level in the MISC field.

msklvsk commented 7 years ago

Confirming: we have multiple places in Ukrainian where sentences span across paragraphs: dialogs, lists in legal texts, verse. Those pars were originally present in the source. Your proposal looks good.

foxik commented 7 years ago

I like Nivre's suggestion -- maybe we could have # newpar comment for paragraph boundaries at sentence level, and NewPar=Yes for in-sentence paragraph boundaries (i.e., used only on non-first token). That way processing can be symmetrical to documents if people care only for paragraph boundaries at sentence level (which I assume would be most usages for automatic processing, judging from how I plan to use these markups).

dan-zeman commented 7 years ago

Fair enough. So the new proposal would be:

The first sentence of a new document contains a comment that says # newdoc. (Nothing more. You can have a separate comment with a document id if you like, but I want to be able to recognize documents even if they don't have ids.) It is not necessary that the first sentence of a CoNLL-U file has the newdoc comment (e.g. if the document is split between dev and test data).
When a paragraph starts at sentence boundary, the first sentence of the paragraph contains a comment that says # newpar. (Again, this comment is not used for paragraph ids.)
When a new paragraph starts between two tokens of a sentence, the first token of the new paragraph contains the attribute NewPar=Yes in the MISC column. If it is a multi-word token, the attribute will appear in the line of the multi-word token, not in the line of its first syntactic word.

Any objections?

(Alternatively, if people prefer so, we could even allow the doc/par ids, i.e. "# newdoc id = xyz". When generating raw text, I would just ignore them and assume anything that matches either /^#\s*newdoc$/ or /^#\s*newdoc\s/ marks a new document. Similarly for paragraphs, but there is no defined way for ids of intra-sentence paragraphs. Any thoughts on that?)

jnivre commented 7 years ago

Looks good to me. Working on it.

jnivre commented 7 years ago

I implemented the version with "# newdoc id = xyz". I can easily change this if people don't like it.

martinpopel commented 7 years ago

I also agree. Note that this is related to the way how test data for CoNLL2017 shared task will look like. (I know there is a special repo, but let's discuss it here for convenience.) Participants' code at http://tira.io will get an input plain-text file with the test set sentences and my suggestion is

Word-wrap all lines to 80 characters.
Sentences are separated by a single space (sentence segmentation is part of the task).
Paragraph boundaries are marked by an empty line (double newline).

The question is whether intra-sentence paragraph boundaries should be marked (by the empty line) as well. I guess the answer is yes.

foxik commented 7 years ago

Personally I like the # newdoc( id = .*)? version with optional id. We could allow this also for paragraphs (we needed sentence ids in V2, so maybe paragraphs ids could come up later). In case of paragraphs, in addition to # newpar( id = .*)? we would need a feature for MISC column -- something like NewPar=Yes|NewParId=...?

@martinpopel Personally I would not mark intra-sentence paragraph boundaries in the raw data. If you do so, then paragraph boundaries have little meaning (you cannot use them for detecting end of sentence which was induced by the paragraph boundary, which is common and problematic in the English data for example). But note that I say this more as a participant then a task organizer :-)

@martinpopel I am not sure whether Chinese and/or Japanese data use spaces between sentences. If they do not, then adding a space between sentences would make sentence segmentation trivial.

dan-zeman commented 7 years ago

OK, I will have ids in Czech as well.

@foxik : The task is about parsing raw text as in real world, thus I would not hide intra-sentence paragraphs from the systems.

Sentences in Chinese are not separated by spaces (purely visually they are, because the period glyph leans towards the left edge of the square it is assigned), so we should make sure not to add any spaces in these languages. Consider this example from Chinese Wikipedia:

英格利希非常年轻时就从政，成为印第安纳州杰西·布莱特保守派系民主党的一分子。1845年，他开始在首都的联邦官僚机构任职，于1850年返回印第安纳州，还参加了该州的制宪大会。1851年，年仅29岁的英格利希当选州众议员并成为议长。两年后，他又当选为联邦参议员并3度连任，从1853年开始一直任职到1861年，这一期间他最大的成就便是达成妥协法案，接纳堪萨斯成为美国联邦的一个新州。1861年，...

amir-zeldes commented 7 years ago

I've been using:

# begin document NAMEOFDOCUMENT

At the first sentence, based on the CoNLL coreference shared task format. But I'm happy to switch to

# newdoc id = NAMEOFDOCUMENT

I would only vote against:

Separate comment for new document beginning and the ID (this could lead to having a doc ID without the newdoc comment, which I think is an error)
Marking anything with double blank lines - we have comments and we have 1 blank line between sentences. Introducing a third mechanism is highly undesirable from my point of view, and a double new line can happen much more easily by accident (think of concatenating multiple conllu files, where one file has a long tail of trailing new lines)

martinpopel commented 7 years ago

we needed sentence ids in V2, so maybe paragraphs ids could come up later

sent_id can encode the document ID and paragraph ID. This internal structure of sent_id is already used in several treebanks, although it is not standardized yet (e.g. cmpr9406-009-p3s1 means doc=cmpr9406-009, paragraph=3, sentence=1). On the one hand, I would prefer simplicity and just one attribute sent_id instead of sent_id + newdoc + id + newpar + id (not to mention the token-level parts). On the other hand, I would prefer simplicity:-) and like to have sent_id short as it may appear in cross-sentence links (coreference, bridging, discourse...) in future. This mainly concerns the document IDs (paragraph and sentence IDs will be probably just integer). So for example, cmpr9406-009 seems as a usable document ID, but I would like to prevent IDs such as w-005.SCG*LB1.CP--++7.N.-3.7-2.8-2, which are long and difficult to grep and include in URLs or XML ids without extensive escaping. So if such document names are needed to be stored, I would keep them in a separate CoNLL-U comment, not in sent_id. Of course, this is mostly up to the treebank maintainers, but there should be UD recommendations.

I would not mark intra-sentence paragraph boundaries in the raw data

I agree it's more difficult for the shared-task participants, but it's more real world. In legal texts, you can easily recognize paragraphs (bullets, numbered lists), but you cannot be sure if there is a sentence boundary as well. That said, if there are just one or two treebanks with this complexity, I would prefer to drop it to make the shared task simpler (no need to document NewPar=Yes).

whether Chinese and/or Japanese data use spaces between sentences

Good catch. I meant the raw data should be as it is usually written.

martinpopel commented 7 years ago

Marking anything with double blank lines - we have comments and we have 1 blank line between sentences.

This is misunderstanding. The double newline I mentioned was for plain-text files as an input for the parsers in CoNLL2017 shared task, not for the CoNLL-U format. So don't worry, no one suggests double blank lines in CoNLL-U.

amir-zeldes commented 7 years ago

Oh, OK, my apologies I misunderstood.

BTW I also embed the doc id into the sentence id, but I also use a newdoc comment. It's easy to ignore stuff you don't need...

foxik commented 7 years ago

I understand why @martinpopel and @dan-zeman want the inter-sentence paragraphs in the raw data, so let them be there.

BTW, note that this just means that in the raw data there are multiple types of paragraphs, because in some languages (at least English and Czech) some paragraph boundaries are used as sentence-level breaks unconditionally (I guess at least heading / text break), while some not (in the treebanks containing inter-sentence paragraph breaks). What it means for the participant is probably that they will create a list of languages which allow inter-sentence paragraph breaks (which is obvious from the data) and allow them only for those.

arademaker commented 7 years ago

@dan-zeman In the UD_Portuguese we have the original documents and their sentences. We don't have the paragraphs explicit but we may be able to recover it in the future from the original Linguateca files. About the suggestion of annotation, I would prefer to have all metadata in the comment lines in a more uniform manner KEY = VALUE, this would facilitate the parsing of CONLLU files. That said, why not

# new_document = true (or id)

@martinpopel said:

I am not sure if there are more treebanks where sentences span over more paragraphs and the paragraph boundaries are annotated.

We have in Portuguese many issues about discourse structures. Yesterday I had a conversation with @claudiafreitas about it. Sentences terminated with unusual punctuations such as :, quotes that start and end in different sentences etc. This is definitely something to be investigate further.

@msklvsk said

Confirming: we have multiple places in Ukrainian where sentences span across paragraphs: dialogs, lists in legal texts, verse. Those pars were originally present in the source. Your proposal looks good.

Do you think that the two proposals from @dan-zeman (sentence level and MISC field) are sufficient for annotate all these cases?

msklvsk commented 7 years ago

@arademaker Actually, the most presice solution from information preserving perspective, I belive, would be to not just mark paragraph boundaries, but paragraphs’ start/end separately. Sometimes we grab texts from the middle of the paragraph. If we enclose them in "<p></p>" that would not actually be truth. That is, as if we copypasted from html: we meet </p> we put ParClose=Yes, we meet <p> we put ParOpen=Yes. That would tell AI the truth. For the sake of parsing simplicity, I would not introduce newpar if annotating the first token in a sentence is equivalent.

dan-zeman commented 7 years ago

@arademaker :

I would prefer to have all metadata in the comment lines in a more uniform manner KEY = VALUE

It sounds like unnecessarily limiting the sentence-level comments, where matching whole lines against regular expressions can pick the attribute you are looking for. I already have got a few treebanks with the # newdoc id = xyz pattern. But as I said, all I need now is to grep the initial keyword, so even if you have in your data # newdoc = true, it should be recognized when raw text is generated (provided the keyword is newdoc and not new_document as in your example).

The first version of the script that generates the raw text from a CoNLL-U file is here: https://github.com/UniversalDependencies/tools/blob/master/conllu_to_text.pl. It can take the --lang option followed by language code. Codes zh and ja will trigger the algorithm that is more suitable for Chinese and Japanese.

dan-zeman commented 7 years ago

I have put the specification in the documentation of the CoNLL-U format:

http://universaldependencies.org/format.html#paragraph-and-document-boundaries

Ansa211 commented 6 years ago

Some remarks on the current state of newdoc and newpar comments:

the specification says that "It is not necessary that the first sentence of a CoNLL-U file has the newdoc comment (e.g. if the document is split between development and test data)." But in the following corpora, none of the files starts with a newdoc, which I believe should not be the case (the first newdoc in the corpus is listed, assuming concatenation in the train-dev-test order): UD_Belarusian-HSE/be_hse-ud-train.conllu:4576 UD_English-LinES/en_lines-ud-train.conllu:7403 UD_Portuguese-Bosque/pt_bosque-ud-train.conllu:37314 UD_Russian-Taiga/ru_taiga-ud-test.conllu:2009 UD_Swedish-LinES/sv_lines-ud-train.conllu:6458
some corpora distinguish between document id (marked on the newdoc comment and having the form of a whitespace-free string) and document title (# doc_title = ...); but there are things like
```
UD_Romanian-Nonstandard/ro_nonstandard-ud-train.conllu:2:# newdoc id = New Testament 1648 Alba Iulia Gospel-1801-5173
UD_Lithuanian-HSE/lt_hse-ud-train.conllu:259:# newdoc id = lt-ru-4-Venclova-AŠ_DŪSTU-Я задыхаюсь
UD_Belarusian-HSE/be_hse-ud-train.conllu:4576:# newdoc = Звязда. Эканоміка, Сельская гаспадарка
```
(the "id" keyword is missing in this last case; a separate title comment immediately follows, so maybe these 4 words are really some kind of id?) -->doc_title should also be mentioned in the specification as it would be more suitable in these cases

Listing the sentence text before the newdoc comment is counter-intuitive:

$head -6 UD_Old_French-SRCMF/fro_srcmf-ud-train.conllu
# text = E l' arcevesque lor ocist Siglorel L' encanteür ki ja fut en enfer
# newdoc id = Roland_1100_verse
# sent_id = 1344
1       E       _       CCONJ   CONcoo  _       5       cc:nc   _       _
2       l'      _       DET     DETdef  Definite=Def|PronType=Art       3       det     _       _
3       arcevesque      _       NOUN    NOMcom  _       5       nsubj   _       _

I would suggest that document-level metadata should always precede paragraph-level metadata, which in turn should always precede sentence-level metadata.

I think newdoc comments with an id and newdoc comments without an id should not be mixed in the same corpus/file (as they are in UD_Belarusian-HSE/be_hse-ud-train.conllu, UD_Yoruba-YTB/yo_ytb-ud-test.conllu)

UniversalDependencies / docs

Paragraph and document boundaries #412