Sentence Splitting issue

bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Mozilla Public License 2.0

209 stars 33 forks source link

Hi! I would like to ask you something about the splitting of text into sentences during the annotation phase.

I thought that the sentences were split by considering dots at the end of them, but it is not always the case. Sometimes sentence separators are ":" or a term in uppercase.

I would like to ask:

What is the rule for sentence splitting?
Is it possible to set the separator? For instance, split a sentence only when a dot is found.

I’m using the udpipe package in R. Below is an example text where I find that sentences are separated by an uppercase term:

model <- udpipe_download_model(language = "english") txt <- c("No previous study has investigated the influence of governance and organizational AHCs configurations on the productivity and scientific impact of AHCs.") df <- udpipe(txt, object = udpipe_load_model(model$file_model))

Thank you!!

bnosac / udpipe

Sentence Splitting issue #123