clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.72k stars 1.58k forks source link

sentiment does not return value between -1 and 1 #52

Closed jwijffels closed 10 years ago

jwijffels commented 10 years ago

Hellow,

The doc states that sentiment returns a polarity value between -1 and 1 but this does not appear to be the case. E.g. the following code below gives an even lower value than -1. Why is this?

from pattern.nl import sentiment sentiment("ik vind het heel vervelend als dat gebeurt")
(-1.0133333333333332, -1.0133333333333332)

tom-de-smedt commented 10 years ago

Adverbs act as intensifiers, so "heel vervelend" gets the score of "vervelend" x1.6. Of course, the eventual output should be between -1.0 and +1.0. This was fixed a while ago. Try pulling the latest revision from GitHub, it should make the problem go away.

jwijffels commented 10 years ago

Hi,

I find it hard to believe that this is a floating point number issue. when I check the dutch sentiment of the following: from pattern.nl import sentiment sentiment("niet zozeer over de callcenter medewerker, maar over de enorm slechte service die de aanleiding vormde")

I get a sentiment of -1.33. This doesn't look like a floating point number issue to me. The solution now only caps the values to -1/1. Is it correct that the algorithm itself returns values largely smaller than -1 e.g?

Jan

2013/11/12 Tom De Smedt notifications@github.com

Closed #52 https://github.com/clips/pattern/issues/52.

— Reply to this email directly or view it on GitHubhttps://github.com/clips/pattern/issues/52 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708

tom-de-smedt commented 10 years ago

Hi Jan,

The algorithm always returns values between -1.0 and +1.0. This was fixed a few weeks ago. I thought I might have missed something, so I merged the recent proposed solution to review it, but it turns out it is not necessary.

The rationale is as follows:

print sentiment("enorm goed")     # +1.00
print sentiment("echt goed")      # +0.88
print sentiment("goed")           # +0.55
print sentiment("niet echt goed") # -0.17
print sentiment("niet goed")      # -0.28
print sentiment("echt niet goed") # -0.04 XXX should handle this case better
print sentiment("slecht")         # -0.70
print sentiment("echt slecht")    # -1.00
print sentiment("enorm slecht")   # -1.00

To get the best predictive accuracy, you should just check if a value is >= 0.1 (positive) or < 0.1 (negative).

Grz, Tom

jwijffels commented 10 years ago

Hi Tom,

Thanks for the feedback. I was just worried because the solution which you merged into pattern indicated 'floating point inaccuracies' which clearly was not the case. It is because internally the values can go over the bounds. Anyhow again thx for the feedback

groeten Jan

2013/11/14 Tom De Smedt notifications@github.com

Hi Jan,

The algorithm always returns values between -1.0 and +1.0. This was fixed a few weeks ago. I thought I might have missed something, so I merged the recent proposed solution to review it, but it turns out it is not necessary.

The rationale is as follows:

  • Sentiment can range from entirely negative (-1.0) to somewhat negative (-0.5) to neutral (0.0) to somewhat positive (+0.5) to entirely positive (+1.0).
  • Adverbs can push a somewhat negative or positive adjective to entirely negative or positive.

print sentiment("enorm goed") # +1.00print sentiment("echt goed") # +0.88print sentiment("goed") # +0.55print sentiment("niet echt goed") # -0.17print sentiment("niet goed") # -0.28print sentiment("echt niet goed") # -0.04 XXX should handle this case betterprint sentiment("slecht") # -0.70print sentiment("echt slecht") # -1.00print sentiment("enorm slecht") # -1.00

  • To accomplish this, internally the values sometimes go over the bounds.
  • Yes, there used to be a bug that returned the raw values. But this makes no sense: a statement can not be double entirely negative or absolutely entirely positive.
  • So when the score of an adjective is multiplied by the score of an adverb, the raw value is immediately micro-capped, see for example https://github.com/clips/pattern/blob/master/pattern/text/__init__.py#L1905

To get the best predictive accuracy, you should just check if a value is

= 0.1 (positive) or < 0.1 (negative).

Grz, Tom

— Reply to this email directly or view it on GitHubhttps://github.com/clips/pattern/issues/52#issuecomment-28487949 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708