bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

In keywords_phrases() function the is_regex=T option has broken in 0.5 #20

Closed sanjmeh closed 6 years ago

sanjmeh commented 6 years ago

I am running side by side the same code, same data on two machines.

One is on udpipe 0.4 and the other on udpipe 0.5 version.

The keywords_phrases() function is broken on 0.5 if we use is_regex=T

Consider the sample example in your help document.

data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), sep = "-")
head(np)

The above should work in both 0.4 & 0.5.

Now consider the same example but with the function executed with is_regex=T

np <- keywords_phrases(x$xpos, pattern = c("DTNNVBRBJJ"), term = x$token,is_regex=T)
head(np)
# [1] keyword ngram   pattern start   end    
# <0 rows> (or 0-length row.names)

I tried with many regex, even as simple as just pattern = "DTJJ" but none works. It seems the regex option does not work.

I have also tested that regex works on the machine (an ubuntu server) by checking out the grep family of commands in R. So regex does not work in the udipe function only,

jwijffels commented 6 years ago

Thanks for reporting. Tested this out on Windows and this did gave 72 results for np <- keywords_phrases(x$xpos, pattern = c("DTNNVBRBJJ"), term = x$token,is_regex=T) On version 0.4 as well as version 0.5. Also there has not been a change in the function when comparing version 0.4 to version 0.5 of this R package.

So this seems to be Linux specific. The regular expression uses <regex> from C++11, this was only released in version gcc 4.9.0. Which version of gcc do you have on your machine (what does gcc --version indicate)

jwijffels commented 6 years ago

Update. Checked this on Ubuntu 14.04 with gcc 4.8.4 and indeed np <- keywords_phrases(x$xpos, pattern = c("DTNNVBRBJJ"), term = x$token,is_regex=T) did not return anything. While on Ubuntu 16.04 with gcc 5.4.0, everything works fine, np <- keywords_phrases(x$xpos, pattern = c("DTNNVBRBJJ"), term = x$token,is_regex=T) returns 72 rows. Solution, make sure you have gcc 4.9.0 (see also: https://stackoverflow.com/questions/12530406/is-gcc-4-8-or-earlier-buggy-about-regular-expressions)

sanjmeh commented 6 years ago

Upgraded to gcc 4.9 (earlier it was 4.8.5).

Current version:

gcc --version
gcc (Ubuntu 4.9.4-2ubuntu1~14.04.1) 4.9.4

But the problem persists.

> library(udpipe)
....
> np <- keywords_phrases(x$xpos, pattern = c("DT"), term = x$token,is_regex = T)
> np
# [1] keyword ngram   pattern start   end    
# <0 rows> (or 0-length row.names)

About how I upgraded my gcc, here are a few steps, and you may be able to point out if there was a mistake somewhere. I do appreciate it is beyond the scope of udpipe package but this may save a lot of other udpipe users with Ubuntu 14.04 or gcc 4.8 or lower in getting frusutrated.

I have an ubuntu 14.04 machine. I followed these instructions to update and it happened succesfully. As a last step, I found it necessary to change the symbolic link /usr/bin/g++ from a target of /usr/bin/g++-4.8to a target of /usr/bin/g++-4.9

I also checked the gcc version, it shows 4.9 but the regex still returns false.

jwijffels commented 6 years ago

Have you re-installed the udpipe package after you upgraded gcc? Please do.

sanjmeh commented 6 years ago

Yes indeed, I had not reinstalled udpipe. Finally, it works. Thank you so much. Now I can hope to load my NLP packages online on Ubuntu and share with people for annotating manually and displaying processed text using shinydashboards or flexdashboards. I am closing this issue now. Thanks a lot.

jwijffels commented 6 years ago

Feel free to share shinydashboards & flexdashboards. That would be interesting!

sanjmeh commented 6 years ago

@jwijffels : could you pls share your email id? Don't know how to communicate with you when there's no issue I have to report.

jwijffels commented 6 years ago

You can find my email here: https://github.com/bnosac/udpipe/blob/master/DESCRIPTION