cjhutto / vaderSentiment

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
MIT License
4.38k stars 1k forks source link

Compound score diverges for long sequences #151

Open VincentGurgul opened 2 months ago

VincentGurgul commented 2 months ago

The compound score has a serious flaw – it diverges for long sequences. Example:

polarity_scores('bad good')
>>> {'neg': 0.547, 'neu', 0.0, 'pos': 0.453, 'compound': -0.1531}
polarity_scores('bad good bad good bad good bad good bad good bad good bad good bad good bad good bad')
>>> {'neg': 0.547, 'neu', 0.0, 'pos': 0.453, 'compound': -0.8979}

It seems, the 'neg' and 'pos' scores are averages, whereas the 'compound' score is some sort of a sum. Thus, the compound score always takes on extreme values for long sequences, like Reddit posts or news articles.

This is particularly unfortunate, since a lot of beginners will blindly use the compound score without noticing this and get discouraged by the poor results. I suggest replacing the current implementation of the compound score with compound = pos – neg or completely removing it.