bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

sentence demarcation in POS tagging fails with a period at times and not always. #19

Closed sanjmeh closed 6 years ago

sanjmeh commented 6 years ago

I have been trying to POS tag legal documents but at many places the udpipe R package breaks a sentence into 2 sentences when it encounters a period that is not actually the not the end of sentence marker. For example, if a sentence is

  1. In Moti Laminates Pvt. Ltd. v. Collector of Central Excise, Ahmedabad 1995(76) E.L.T.241(SC) we get a clue of an important principle, namely, principle of equivalence .

we get the sentence broken after the 4th PUNCT token of "." Not the first. Which means there is some logic to take care of false sentence end of mark detection. It detected that PUNCT after token=18 is not an end of the sentence. Also after token = 'Pvt' it did not end the sentence. Then why did it end after token= 'v' ?

I suspected udpipe English model checks capitalisation of next token to decide sentence ending but that doesn't happen here.

Here is the document term matrix subsetted with above text of two sentences (which is actually one).

Could you suggest a way to avoid false detections like these?

Thank you very much.

xx.txt

jwijffels commented 6 years ago

I think you are misunderstanding how udpipe does tokenisation. The logic of the tokenisation and identification of sentences is explained in the paper http://ufal.mff.cuni.cz/%7Estraka/papers/2017-conll_udpipe.pdf. Let me copy the relevant parts here

Sentence segmentation and tokenization is performed jointly (as it was in UDPipe 1.0) using
a single-layer bidirectional GRU network which predicts for each character whether it is the last
one in a sentence, the last one in a token, or not the last one in a token. Spaces are usually not allowed
in tokens and therefore the network does not need to predict end-of-token before a space (it only
learns to separate adjacent tokens, like for example Hi! or cannot).

So sentence detection is done by a probabilistic logic, not some kind of deterministic logic, the GRU learner which is used to train tokenisation & sentence segmentation for the English model on conllu data has the following parameters https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-train.html#settings_for_the_tokenizer So the answer to your question might be one of the following ones

sanjmeh commented 6 years ago

Thanks Jan for the detailed response. Appreciate it. Am reading the links you pointed me to.

jwijffels commented 6 years ago

Closing this. If you want to reopen, feel free.