Tagger is very slow - Githubissues

trinker commented 7 years ago

I am the author of the sentimentr package and have been benchmarking various sentiment packages against each other for accuracy and speed for about a year now. Your package intrigued me when it came out last year b/c you were attempting things beyond a simple lookup like my own sentimentr package but with different approaches. Last year I benchmarked RSentiment and it had decent accuracy and speed (these two things are a tradeoff). However, I am running some tests now and see that RSentiment is too slow to be a viable solution for text of medium-large size. It's even slower than using R to run coreNLP from Stanford which I consider most accurate of non-machine learning approaches. I looked at the code and see you're using parts of speech. Smart..I've thought about this myself. The problem is in your implementation you're running the tagger over every sentence instead of all the text at once as it is intended to be. Could it be a possibility to re-write the POS tagger to operate on all text at once rather than iterating over all elements in the text vector?

Unit: milliseconds
                   expr         min          lq        mean      median          uq         max neval
    sentimentr_hu_liu()    193.3937    196.0196    199.0491    198.6454    201.8769    205.1083     3
 sentimentr_sentiword()    775.2589    779.8688    877.8642    784.4786    929.1668   1073.8550     3
           RSentiment() 126609.2093 126888.5834 127266.5046 127167.9574 127595.1522 128022.3469     3
    SentimentAnalysis()   2481.5213   2515.4932   2563.7821   2549.4652   2604.9125   2660.3598     3
      syuzhet_syuzhet()    529.3977    533.6333    538.4884    537.8689    543.0338    548.1987     3
         syuzhet_binn()    370.2313    370.5579    378.7850    370.8845    383.0619    395.2394     3
          syuzhet_nrc()    702.2139    819.3012    878.6898    936.3885    966.9277    997.4669     3
        syuzhet_afinn()    128.5408    137.1374    160.7802    145.7341    176.9000    208.0659     3
             stanford()  25243.6973  25759.9738  25994.3613  26276.2504  26369.6933  26463.1361     3

gaborcsardi commented 7 years ago

Hi, this is a read-only mirror of CRAN, please send your comments to the package authors. THanks!

sefabey commented 7 years ago

Came here to say the same thing. Just tried the package with very small sentences and I can confirm it is very slow. Sentences with 7 words take around 2 seconds to calculate sentiment scores. I cannot think of applying the sentiment score function to a dataset with 100K rows. Package needs attention I presume.

SubhasreeBose commented 7 years ago

Hi! Today only I noticed the issues mentioned in read only mirror of CRAN. I have written my package getting inspired by "sentimentr" only. I decided to experiment with Parts of Speech in this package. However, the running time of the package is not so decent. I am working on it to improve it. I am completely new to R and also to writing package. It is taking me time to improve the algorithm. I will try to rewrite the POS tagger as suggested by @trinker.

trinker commented 7 years ago

@SubhasreeBose Per @gaborcsardi 's request (I accidentally posted here in error) could we move this conversation to your repo? Is there a GitHub repo of your own?

SubhasreeBose commented 7 years ago

@trinker Here is my GitHub repository which I created just now. https://github.com/SubhasreeBose/RSentiment

cran / RSentiment

Tagger is very slow #1