cjhutto / vaderSentiment

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
MIT License
4.38k stars 1k forks source link

Special case huge performance regression in 3.3.1+ #110

Open dandelionred opened 4 years ago

dandelionred commented 4 years ago

I've been processing random comments from social media and noticed some strange spikes in processing time logs. Generally it takes less than a second to process a chunk of data. But here on the plot you can see a dot approaching 10 mins!

Screenshot from 2020-08-18 23:53:48

I traced the slow-down back to vader 3.3.1+ 100% cpu usage on texts with huge amount of emoticons.

Test script vader.py

#!/usr/bin/env python3

import sys
import json

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

si = SentimentIntensityAnalyzer()

for line in sys.stdin:
    line = json.loads(line)
    print(json.dumps(si.polarity_scores(line), sort_keys=True))
    sys.stdout.flush()

Sample input slow.json https://pastebin.com/nxjLSTMQ

vader 3.2.1:

$ time ./vader.py < slow.json 
{"compound": 0.8955, "neg": 0.027, "neu": 0.913, "pos": 0.06}

real    0m0.182s
user    0m0.168s
sys 0m0.008s

vader 3.3.1+:

$ time ./vader.py < slow.json 
{"compound": 1.0, "neg": 0.218, "neu": 0.345, "pos": 0.437}

real    0m50.914s
user    0m48.588s
sys 0m2.328s

The input sample is not an artificial joke btw. Here are samples of what supposedly real people post on reddit and I feed vader with stuff like that: