sentence demarcation in POS tagging fails with a period at times and not always.

sanjmeh commented 6 years ago

I have been trying to POS tag legal documents but at many places the udpipe R package breaks a sentence into 2 sentences when it encounters a period that is not actually the not the end of sentence marker. For example, if a sentence is

In Moti Laminates Pvt. Ltd. v. Collector of Central Excise, Ahmedabad 1995(76) E.L.T.241(SC) we get a clue of an important principle, namely, principle of equivalence .

we get the sentence broken after the 4th PUNCT token of "." Not the first. Which means there is some logic to take care of false sentence end of mark detection. It detected that PUNCT after token=18 is not an end of the sentence. Also after token = 'Pvt' it did not end the sentence. Then why did it end after token= 'v' ?

I suspected udpipe English model checks capitalisation of next token to decide sentence ending but that doesn't happen here.

Here is the document term matrix subsetted with above text of two sentences (which is actually one).

Could you suggest a way to avoid false detections like these?

Thank you very much.

xx.txt

jwijffels commented 6 years ago

I think you are misunderstanding how udpipe does tokenisation. The logic of the tokenisation and identification of sentences is explained in the paper http://ufal.mff.cuni.cz/%7Estraka/papers/2017-conll_udpipe.pdf. Let me copy the relevant parts here

Sentence segmentation and tokenization is performed jointly (as it was in UDPipe 1.0) using
a single-layer bidirectional GRU network which predicts for each character whether it is the last
one in a sentence, the last one in a token, or not the last one in a token. Spaces are usually not allowed
in tokens and therefore the network does not need to predict end-of-token before a space (it only
learns to separate adjacent tokens, like for example Hi! or cannot).

So sentence detection is done by a probabilistic logic, not some kind of deterministic logic, the GRU learner which is used to train tokenisation & sentence segmentation for the English model on conllu data has the following parameters https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-train.html#settings_for_the_tokenizer So the answer to your question might be one of the following ones

Use a udpipe model which was trained on data which is similar as your data (data sources used for UDPipe models are explained at http://universaldependencies.org for your language of choice)
Train a model on your own tagged data.
Preprocess your text with your own tokeniser to feed it into udpipe_annotate or postprocess the output with your extra business logic
Change the setting of the tokeniser (e.g. the dimension parameter) which was used to build the udpipe model. Training your own model is explained at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-train.html

sanjmeh commented 6 years ago

Thanks Jan for the detailed response. Appreciate it. Am reading the links you pointed me to.

jwijffels commented 6 years ago

Closing this. If you want to reopen, feel free.

bnosac / udpipe

sentence demarcation in POS tagging fails with a period at times and not always. #19