Tokens dropped with quoted text

fmerhout commented 6 years ago

Thanks for writing this great package!

I am trying to parse tweets and came across an issue with the tagger when passing it quoted text. The tokens after and before the quotation marks are deleted in the tagging process.

Here is an example:

rdr_pos(rdr_model(language = "English", annotation = "UniversalPOS"), "Some guy asked -\"what is the issue\"")

The returned object is missing "what" and "issue".

For the time being, I am simply gsub'ing the \" but this would obviously be better addressed internal to the function.

jwijffels commented 6 years ago

Thanks for your interest in this package. May I ask why you prefer this package over the udpipe R package (https://cran.r-project.org/web/packages/udpipe/index.html), especially as you seem to work with English text? Note that the input that the rdr_pos requires is already tokenised text which is done internally using the function rdr_add_space_around_punctuations which might be not the most ideal for your case.

fmerhout commented 6 years ago

I am contributing to this repo and we are currently trying to extend the original functionality to include sentiment analysis to identify speaker's position on the things they discuss.

We did use cleanNLP (which has a udpipe backend option) in a previous iteration but have found it to be slow and very memory intensive, to the point that we experienced crashes when the data to be parsed would get too big, i.e. 100k+ tweets.

Another consideration is that we are hoping to reduce the required setup as much as possible to have a low entry barrier for those who just want to use the package but don't have a lot of background knowledge, and our own experience with setting up cleanNLP/udpipe was rather rocky.

jwijffels commented 6 years ago

Short answer to your problem: make sure the text is correctly tokenised before passing to the rdr_pos function and indicate add_space_around_punctuations FALSE.

Long answer: I just checked and the reported issue is really coming from the java side which requires correctly pretokenised text. This R package is just an R wrapper around https://github.com/datquocnguyen/RDRPOSTagger and the R package requires to provide tokenised text. If it is not correctly tokenised like this case, you might expect some of these issues like you just encountered. If you look into this package for speed, it will probably not the best option. Also keep in mind that the package is built on top of data from universal dependencies. There is no twitter corpus for that, only regular sentences. Regarding udpipe, just use udpipe directly instead of using the cleannlp wrapper. My 5 cents of advise: use the tokenize_tweets function from the tokenizers package and next look at the part 'doing only part of the annotation' and 'my text is already tokenised' at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html#annotate_your_text Interesting project this textnets by the way. I was thinking on doing something similar.

fmerhout commented 6 years ago

Thank you for this advice! Given your previous question, I was already suspecting that I will need to look into the tokenization - also because I want to combine the rdr_pos output with tidytext output which uses the tokenizers package, which was also causing all sorts of troubles.

In any case, this is really great advice and much appreciated. Thank you!

fmerhout commented 6 years ago

Thanks again for this excellent advice! We just updated the key function in the repo in accordance with your suggestion and so far it looks very promising. Would be interesting to hear your thoughts.

jwijffels commented 6 years ago

Good the advise helped!

bnosac / RDRPOSTagger

Tokens dropped with quoted text #1