cjhutto / vaderSentiment

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
MIT License
4.38k stars 1k forks source link

Duplicates in dictionary: double entries with different sentiment values #122

Open chris31415926535 opened 3 years ago

chris31415926535 commented 3 years ago

Thanks for making VADER. I'm working on another port and am having a blast.

There are instances of words/emojis that have two entries with different sentiment values in the most recent version of vader_lexicon.txt. This is a potential source of bugs and inconsistencies between ports. I've included the list below with the line number in vader_lexicon.txt, the words, and the sentiment values.

It looks the Python version of VADER takes the last value it finds. For example, "lol" has two sentiment values: +2.9 at line 305, and+1.8 at line 4406. To reproduce the output in test sentence 13 from the main Readme (copied below), I need to assign "lol" a sentiment of 1.8.

Today only kinda sux! But I'll get by, lol----------------------- {'pos': 0.317, 'compound': 0.5249, 'neu': 0.556, 'neg': 0.127}

I see three main options:

  1. Leave it as-is. This seems least desirable, since it leads to unpredictable and potentially inconsistent behaviour across instantiations.
  2. Update the dictionary to match the current behaviour by removing each second instance of the 14 words below. This would be easy, but the potential downside is that some of the differences are big: e.g. "d:" has a positive instance and a negative instance, and "sob"'s larger value is more than double the smaller value.
  3. Update the dictionary to match your intuition. A case-by-case approach wouldn't take long since there are only 14 instances, and a standard approach (e.g. averaging the two values) would also be simple.

Obviously it's your call, but I didn't see this in any other Issues or Pull Requests so I wanted to surface it. I'm happy to chat or help in any way I can.

line number word sentiment
120 :-p 1.2
124 :-p 1.5
227 d: -2.9
1740 d: 1.2
230 d= -3
1741 d= 1.5
234 fav 2.4
2831 fav 2
301 lmao 2
4399 lmao 2.9
305 lol 2.9
4406 lol 1.8
320 muah 2.8
4730 muah 2.3
342 o.o -0.6
4853 o.o -0.8
352 ok 1.6
4895 ok 1.2
385 sob -2.8
6188 sob -1
411 x-d 2.7
7489 x-d 2.6
412 x-p 1.8
7490 x-p 1.7
413 xd 2.7
7491 xd 2.8
417 xp 1.2
7492 xp 1.6
TjallingO commented 3 years ago

This issue stumped me as well during the development of my own port. There are even more duplicates, like

line no. element sentiment
342 o.o -0.6
4853 o.o -0.8

I worked around the issue by replacing existing mappings by subsequent entries, thus keeping the original lexicon intact. However, as you mentioned, this does not seem like a sustainable solution. I would really appreciate a follow-up from @cjhutto or any of the other co-authors as to what would be the most appropriate permanent option.