UniversalDependencies / UD_Swedish-Talbanken

Swedish data
Other
13 stars 2 forks source link

Non-segmented sentence #5

Closed erickrf closed 5 years ago

erickrf commented 5 years ago

I found that sentence sv-ud-train-3727, with 296 tokens (!), was not segmented -- it is actually several sentences concatenated.

jnivre commented 5 years ago

Thanks for reporting this. However, it is in fact a single very long sentence with a quite complex internal structure. First of all, it contains a very long list of appositions, each with its own complex internal structure. In addition, there are other sentences inserted parenthetically in the form of footnotes. As a result, there is no trivial way of segmenting it into several sentences without breaking real syntactic relations between segments. Therefore, we have chosen to preserve the segmentation inherited from the original treebank, released in the 1970s. A possible alternative would be to extract the parenthetical insertions and annotate them as separate sentences, but this would mean that sentences in the treebank would not occur sequentially in the same order as in the original text.