bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

How to deal with paragraphs? #15

Closed mustaszewski closed 6 years ago

mustaszewski commented 6 years ago

The annotation output contains a column for identifying the paragraph. In what format should paragraph boundaries be encoded in the input character vectors? I have not found any information on this in the documentation. Therefore, all multi-sentence character vectors are always being parsed as one paragraph only.

jwijffels commented 6 years ago

That is explained in the paper which explains how the models learn how to do the annotation. The paper is indicated in the README as well as the package description at https://cran.r-project.org/web/packages/udpipe/index.html See http://dx.doi.org/10.18653/v1/K17-3009 I'm quoting from the paper section 'Documents & Paragraphs'

We use an improved sentence segmenter in UDPipe 1.1 Baseline System. The segmenter learns sentence boundaries in the text in a standard way as in UDPipe 1.1 Baseline System, but it omits the sentence breaks at the end of a paragraph or a document. The reason for excluding these boundaries from the training data is that the ends of paragraphs and documents are frequently recognized by layout (e.g. newspaper headlines) and if the recognizer is trained to recognize these sentence breaks, it tends to erroneously split regular sentences. Additionally, we now also mark paragraph boundaries (recognized by empty lines) and document boundaries (corresponding to files being processed, storing file names as document ids) when running the segmenter.

So the short answer is: \n\n identifies paragraphs.

mustaszewski commented 6 years ago

Great, thank you very much. Makes perfect sense!