cran / udpipe

:exclamation: This is a read-only mirror of the CRAN R package repository. udpipe — Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit. Homepage: https://bnosac.github.io/udpipe/en/index.html, https://github.com/bnosac/udpipe
0 stars 0 forks source link

keywords_phrases has broken in 0.5 #1

Closed sanjmeh closed 6 years ago

sanjmeh commented 6 years ago

I am running side by side the same code, same data on two machines.

One is on udpipe 0.4 and the other on udpipe 0.5 version.

The keywords_phrases() function is broken on 0.5 if we use is_regex=T

Consider the sample example in your help document.

data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), sep = "-")
head(np)

The above should work in both 0.4 & 0.5.

Now consider the same example but with the function executed with is_regex=T

np <- keywords_phrases(x$xpos, pattern = c("DTNNVBRBJJ"), term = x$token,is_regex=T)
head(np)
# [1] keyword ngram   pattern start   end    
# <0 rows> (or 0-length row.names)

I tried with many regex, even as simple as just pattern = "DTJJ" but none works. It seems the regex option does not work.

I have also tested that regex works on the machine (an ubuntu server) by checking out the grep family of commands in R. So regex does not work in the udipe function only,

gaborcsardi commented 6 years ago

Hi, this is a read-only mirror of CRAN, please see the package authors in the DESCRIPTION file.