cadmiumcr / sentiment

MIT License
1 stars 1 forks source link

Sentiment Analysis - Error on composed words #3

Open alex-lairan opened 4 years ago

alex-lairan commented 4 years ago

Hi,

I use sentiment analysis for testing purposes, and I found something with composed words.

I have this code :

require "cadmium"

sentiment = Cadmium.sentiment
pp sentiment.analyze "I realy don't like mosquitoes"
pp "I realy don't like mosquitoes".is_negative?

The result is :

{score: 2,
 comparative: 0,
 tokens: ["I", "realy", "do", "n't", "like", "moskitoes"],
 words: ["like"],
 positive: ["like"],
 negative: []}
false

Here, the don't is not followed. I know is a bad English, but it's something you can found on twitter.

I don't know if I'm using it in a wrong way.

watzon commented 4 years ago

Seems like a problem with the tokenizer. I'll look into it.

hugoabonizio commented 4 years ago

Using the pragmatic tokenizer the token don't is recognized, but I think there's a problem with the negation identification which I addressed in cadmiumcr/cadmium#27.

sentiment.tokenizer = Cadmium.pragmatic_tokenizer.new

{score: 2,
 comparative: 0.4,
 tokens: ["i", "realy", "don't", "like", "mosquitoes"],
 words: ["like"],
 positive: ["like"],
 negative: []}
false
watzon commented 4 years ago

The problem with the Pragmatic Tokenizer is that it's much much slower than the other ones. I do not recommend using it internally for anything.

hugoabonizio commented 4 years ago

@watzon it also works with aggressive_tokenizer, but the behavior varies a lot depending on the tokenizer.

watzon commented 4 years ago

Yeah the agressive_tokenizer would probably be the one to use

rmarronnier commented 4 years ago

@watzon : Can we move this issue to cadmiumcr/sentiment repo ? It makes more sense :-)

watzon commented 4 years ago

Yes, it should definitely be moved