bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Sentence Splitting issue #123

Open lucadaniello opened 5 months ago

lucadaniello commented 5 months ago

Hi! I would like to ask you something about the splitting of text into sentences during the annotation phase.

I thought that the sentences were split by considering dots at the end of them, but it is not always the case. Sometimes sentence separators are ":" or a term in uppercase.

I would like to ask:

  1. What is the rule for sentence splitting?
  2. Is it possible to set the separator? For instance, split a sentence only when a dot is found.

I’m using the udpipe package in R. Below is an example text where I find that sentences are separated by an uppercase term:

model <- udpipe_download_model(language = "english") txt <- c("No previous study has investigated the influence of governance and organizational AHCs configurations on the productivity and scientific impact of AHCs.") df <- udpipe(txt, object = udpipe_load_model(model$file_model))

Thank you!!

jwijffels commented 5 months ago

Sentence splitting is based on a statistical classification model trained on conllu data from universaldependencies. It predicts for each letter in the text if a new sentence starts at that letter given the surrounding context. If you want to use another way of splitting, you could use udpipe::strsplit.data.frame or strsplit from base R in order to define your own hardcoded sentence splitting criteria.