Closed dan-zeman closed 4 years ago
I am not sure if there are more treebanks where sentences span over more paragraphs and the paragraph boundaries are annotated. In UD_Czech-CLTT original data, the boundaries were detected (and sub-sentence segments annotated independently, see this prezentation) and encoded in node IDs, but this has been lost when converting to UD (where node IDs is just an integer). I am not sure if it is worth the work to re-convert the treebank.
I guess if people are unhappy about having paragraph information at the token level, we could make a distinction between intersentential paragraph boundaries, which would be annotated at the sentence level just like the document boundaries, and intrasentential paragraph boundaries, which (where they exist) would be annotated at the token level in the MISC field.
Confirming: we have multiple places in Ukrainian where sentences span across paragraphs: dialogs, lists in legal texts, verse. Those pars were originally present in the source. Your proposal looks good.
I like Nivre's suggestion -- maybe we could have # newpar
comment for paragraph boundaries at sentence level, and NewPar=Yes
for in-sentence paragraph boundaries (i.e., used only on non-first token). That way processing can be symmetrical to documents if people care only for paragraph boundaries at sentence level (which I assume would be most usages for automatic processing, judging from how I plan to use these markups).
Fair enough. So the new proposal would be:
# newdoc
. (Nothing more. You can have a separate comment with a document id if you like, but I want to be able to recognize documents even if they don't have ids.) It is not necessary that the first sentence of a CoNLL-U file has the newdoc
comment (e.g. if the document is split between dev and test data).# newpar
. (Again, this comment is not used for paragraph ids.)NewPar=Yes
in the MISC column. If it is a multi-word token, the attribute will appear in the line of the multi-word token, not in the line of its first syntactic word.Any objections?
(Alternatively, if people prefer so, we could even allow the doc/par ids, i.e. "# newdoc id = xyz". When generating raw text, I would just ignore them and assume anything that matches either /^#\s*newdoc$/ or /^#\s*newdoc\s/ marks a new document. Similarly for paragraphs, but there is no defined way for ids of intra-sentence paragraphs. Any thoughts on that?)
Looks good to me. Working on it.
I implemented the version with "# newdoc id = xyz". I can easily change this if people don't like it.
I also agree. Note that this is related to the way how test data for CoNLL2017 shared task will look like. (I know there is a special repo, but let's discuss it here for convenience.) Participants' code at http://tira.io will get an input plain-text file with the test set sentences and my suggestion is
The question is whether intra-sentence paragraph boundaries should be marked (by the empty line) as well. I guess the answer is yes.
Personally I like the # newdoc( id = .*)?
version with optional id
. We could allow this also for paragraphs (we needed sentence ids in V2, so maybe paragraphs ids could come up later). In case of paragraphs, in addition to # newpar( id = .*)?
we would need a feature for MISC column -- something like NewPar=Yes|NewParId=...
?
@martinpopel Personally I would not mark intra-sentence paragraph boundaries in the raw data. If you do so, then paragraph boundaries have little meaning (you cannot use them for detecting end of sentence which was induced by the paragraph boundary, which is common and problematic in the English data for example). But note that I say this more as a participant then a task organizer :-)
@martinpopel I am not sure whether Chinese and/or Japanese data use spaces between sentences. If they do not, then adding a space between sentences would make sentence segmentation trivial.
OK, I will have ids in Czech as well.
@foxik : The task is about parsing raw text as in real world, thus I would not hide intra-sentence paragraphs from the systems.
Sentences in Chinese are not separated by spaces (purely visually they are, because the period glyph leans towards the left edge of the square it is assigned), so we should make sure not to add any spaces in these languages. Consider this example from Chinese Wikipedia:
英格利希非常年轻时就从政,成为印第安纳州杰西·布莱特保守派系民主党的一分子。1845年,他开始在首都的联邦官僚机构任职,于1850年返回印第安纳州,还参加了该州的制宪大会。1851年,年仅29岁的英格利希当选州众议员并成为议长。两年后,他又当选为联邦参议员并3度连任,从1853年开始一直任职到1861年,这一期间他最大的成就便是达成妥协法案,接纳堪萨斯成为美国联邦的一个新州。1861年,...
I've been using:
# begin document NAMEOFDOCUMENT
At the first sentence, based on the CoNLL coreference shared task format. But I'm happy to switch to
# newdoc id = NAMEOFDOCUMENT
I would only vote against:
we needed sentence ids in V2, so maybe paragraphs ids could come up later
sent_id
can encode the document ID and paragraph ID. This internal structure of sent_id
is already used in several treebanks, although it is not standardized yet (e.g. cmpr9406-009-p3s1
means doc=cmpr9406-009, paragraph=3, sentence=1).
On the one hand, I would prefer simplicity and just one attribute sent_id
instead of sent_id
+ newdoc
+ id + newpar
+ id (not to mention the token-level parts).
On the other hand, I would prefer simplicity:-) and like to have sent_id
short as it may appear in cross-sentence links (coreference, bridging, discourse...) in future. This mainly concerns the document IDs (paragraph and sentence IDs will be probably just integer). So for example, cmpr9406-009
seems as a usable document ID, but I would like to prevent IDs such as w-005.SCG*LB1.CP--++7.N.-3.7-2.8-2
, which are long and difficult to grep and include in URLs or XML ids without extensive escaping. So if such document names are needed to be stored, I would keep them in a separate CoNLL-U comment, not in sent_id
. Of course, this is mostly up to the treebank maintainers, but there should be UD recommendations.
I would not mark intra-sentence paragraph boundaries in the raw data
I agree it's more difficult for the shared-task participants, but it's more real world. In legal texts, you can easily recognize paragraphs (bullets, numbered lists), but you cannot be sure if there is a sentence boundary as well. That said, if there are just one or two treebanks with this complexity, I would prefer to drop it to make the shared task simpler (no need to document NewPar=Yes
).
whether Chinese and/or Japanese data use spaces between sentences
Good catch. I meant the raw data should be as it is usually written.
Marking anything with double blank lines - we have comments and we have 1 blank line between sentences.
This is misunderstanding. The double newline I mentioned was for plain-text files as an input for the parsers in CoNLL2017 shared task, not for the CoNLL-U format. So don't worry, no one suggests double blank lines in CoNLL-U.
Oh, OK, my apologies I misunderstood.
BTW I also embed the doc id into the sentence id, but I also use a newdoc comment. It's easy to ignore stuff you don't need...
I understand why @martinpopel and @dan-zeman want the inter-sentence paragraphs in the raw data, so let them be there.
BTW, note that this just means that in the raw data there are multiple types of paragraphs, because in some languages (at least English and Czech) some paragraph boundaries are used as sentence-level breaks unconditionally (I guess at least heading / text break), while some not (in the treebanks containing inter-sentence paragraph breaks). What it means for the participant is probably that they will create a list of languages which allow inter-sentence paragraph breaks (which is obvious from the data) and allow them only for those.
@dan-zeman In the UD_Portuguese we have the original documents and their sentences. We don't have the paragraphs explicit but we may be able to recover it in the future from the original Linguateca files. About the suggestion of annotation, I would prefer to have all metadata in the comment lines in a more uniform manner KEY = VALUE
, this would facilitate the parsing of CONLLU files. That said, why not
# new_document = true (or id)
@martinpopel said:
I am not sure if there are more treebanks where sentences span over more paragraphs and the paragraph boundaries are annotated.
We have in Portuguese many issues about discourse structures. Yesterday I had a conversation with @claudiafreitas about it. Sentences terminated with unusual punctuations such as :
, quotes that start and end in different sentences etc. This is definitely something to be investigate further.
@msklvsk said
Confirming: we have multiple places in Ukrainian where sentences span across paragraphs: dialogs, lists in legal texts, verse. Those pars were originally present in the source. Your proposal looks good.
Do you think that the two proposals from @dan-zeman (sentence level and MISC field) are sufficient for annotate all these cases?
@arademaker Actually, the most presice solution from information preserving perspective, I belive, would be to not just mark paragraph boundaries, but paragraphs’ start/end separately. Sometimes we grab texts from the middle of the paragraph. If we enclose them in "<p></p>
" that would not actually be truth. That is, as if we copypasted from html: we meet </p>
we put ParClose=Yes
, we meet <p>
we put ParOpen=Yes
. That would tell AI the truth.
For the sake of parsing simplicity, I would not introduce newpar
if annotating the first token in a sentence is equivalent.
@arademaker :
I would prefer to have all metadata in the comment lines in a more uniform manner KEY = VALUE
It sounds like unnecessarily limiting the sentence-level comments, where matching whole lines against regular expressions can pick the attribute you are looking for. I already have got a few treebanks with the # newdoc id = xyz
pattern. But as I said, all I need now is to grep the initial keyword, so even if you have in your data # newdoc = true
, it should be recognized when raw text is generated (provided the keyword is newdoc
and not new_document
as in your example).
The first version of the script that generates the raw text from a CoNLL-U file is here:
https://github.com/UniversalDependencies/tools/blob/master/conllu_to_text.pl. It can take the --lang
option followed by language code. Codes zh
and ja
will trigger the algorithm that is more suitable for Chinese and Japanese.
I have put the specification in the documentation of the CoNLL-U format:
http://universaldependencies.org/format.html#paragraph-and-document-boundaries
Some remarks on the current state of newdoc
and newpar
comments:
the specification says that "It is not necessary that the first sentence of a CoNLL-U file has the newdoc
comment (e.g. if the document is split between development and test data)." But in the following corpora, none of the files starts with a newdoc
, which I believe should not be the case (the first newdoc in the corpus is listed, assuming concatenation in the train-dev-test order):
UD_Belarusian-HSE/be_hse-ud-train.conllu:4576
UD_English-LinES/en_lines-ud-train.conllu:7403
UD_Portuguese-Bosque/pt_bosque-ud-train.conllu:37314
UD_Russian-Taiga/ru_taiga-ud-test.conllu:2009
UD_Swedish-LinES/sv_lines-ud-train.conllu:6458
some corpora distinguish between document id (marked on the newdoc
comment and having the form of a whitespace-free string) and document title (# doc_title = ...
); but there are things like
UD_Romanian-Nonstandard/ro_nonstandard-ud-train.conllu:2:# newdoc id = New Testament 1648 Alba Iulia Gospel-1801-5173
UD_Lithuanian-HSE/lt_hse-ud-train.conllu:259:# newdoc id = lt-ru-4-Venclova-AŠ_DŪSTU-Я задыхаюсь
UD_Belarusian-HSE/be_hse-ud-train.conllu:4576:# newdoc = Звязда. Эканоміка, Сельская гаспадарка
(the "id" keyword is missing in this last case; a separate title comment immediately follows, so maybe these 4 words are really some kind of id?)
-->doc_title
should also be mentioned in the specification as it would be more suitable in these cases
Listing the sentence text before the newdoc comment is counter-intuitive:
$head -6 UD_Old_French-SRCMF/fro_srcmf-ud-train.conllu
# text = E l' arcevesque lor ocist Siglorel L' encanteür ki ja fut en enfer
# newdoc id = Roland_1100_verse
# sent_id = 1344
1 E _ CCONJ CONcoo _ 5 cc:nc _ _
2 l' _ DET DETdef Definite=Def|PronType=Art 3 det _ _
3 arcevesque _ NOUN NOMcom _ 5 nsubj _ _
I would suggest that document-level metadata should always precede paragraph-level metadata, which in turn should always precede sentence-level metadata.
I think newdoc comments with an id and newdoc comments without an id should not be mixed in the same corpus/file (as they are in UD_Belarusian-HSE/be_hse-ud-train.conllu, UD_Yoruba-YTB/yo_ytb-ud-test.conllu)
UD guidelines currently do not specify how to mark document and paragraph boundaries and for many treebanks such information is not available (original text gone, sentences shuffled etc.) But where it is available, it can be potentially useful for applications, including but not limited to sentence segmentation.
I am going to acquire this sort of annotation from data providers who have it, and make it available in UD release 2.0 in a unified way. This issue is to propose the way it is encoded in the data, and see if there are comments/suggestions (hopefully quickly solvable ones—the data freeze deadline is coming soon).
It turns out that paragraphs are not always necessarily supersets of sentences. In some cases (bulleted list items), a new paragraph may start in the middle of a sentence. (That of course depends on how the two units are defined but I am not looking for any standardized definition on such a short notice. If you have paragraphs, you define what they are.)
As a result, paragraph boundary should be marked at token level. Document boundary will be marked at sentence level. My proposal:
# newdoc
. (Nothing more. You can have a separate comment with a document id if you like, but I want to be able to recognize documents even if they don't have ids.) It is not necessary that the first sentence of a CoNLL-U file has thenewdoc
comment (e.g. if the document is split between dev and test data).NewPar=Yes
in the MISC column. Usually this will also be the first token of a sentence. If it is a multi-word token, the attribute will appear in the line of the multi-word token, not in the line of its first syntactic word.Specifically seeking feedback from those who indicated they have doc or par info available: @lauma @liljao @jnivre @arademaker @Kira-D @kajad @TomazErjavec @LarsAhrenberg @natko5 and from those who may have to deal with the data :-) @foxik @martinpopel @fginter @spyysalo
(If you already have the info in the data in some form, e.g. inferrable from sentence ids, and prefer me to extract it and convert it to the unified annotation, let me know.)