bwbaugh / infertweet

Infer information from Tweets. Useful for human-centered computing tasks, such as sentiment analysis, location prediction, authorship profiling and more!
http://infertweet.bwbaugh.com/
Other
10 stars 1 forks source link

Hierarchical sentiment classifier, single feature classification, erroneous probabilities? #27

Open bwbaugh opened 11 years ago

bwbaugh commented 11 years ago

Part of the web interface is supposed to show how each feature would be classified if it was a document of length one. Why does the hierarchical sentiment classifier only label these individual features as either neutral or positive, even when the confidence value is less than 0.5?

As an example:

<span style="color: #808080" title="neutral: 48.01%">('__start__', u'This')</span> 
<span style="color: #98c000" title="positive: 60.35%">(u'This',)</span> 
<span style="color: #808080" title="neutral: 45.32%">(u'This', u'is')</span> 
<span style="color: #b3c000" title="positive: 53.17%">(u'is',)</span> 
<span style="color: #808080" title="neutral: 38.86%">(u'is', u'only')</span> 
<span style="color: #c07e00" title="positive: 32.82%">(u'only',)</span> 
<span style="color: #808080" title="neutral: 67.93%">(u'only', u'a')</span> 
<span style="color: #9bc000" title="positive: 59.42%">(u'a',)</span> 
<span style="color: #808080" title="neutral: 51.44%">(u'a', u'test')</span> 
<span style="color: #c0a100" title="positive: 42.09%">(u'test',)</span> 
<span style="color: #808080" title="neutral: 34.62%">(u'test', '__end__')</span> <br>

Current hash: 5fd9baa3551fc1c0af4692cbae7a589ff1ea21e4

bwbaugh commented 11 years ago

Now, using conditional probabilities only (instead of trying to classify each feature as its own document):

<span style="color: #808080" title="neutral: 51.99%">('__start__', u'This')</span> 
<span style="color: #808080" title="neutral: 52.04%">(u'This',)</span> 
<span style="color: #808080" title="neutral: 54.68%">(u'This', u'is')</span> 
<span style="color: #808080" title="neutral: 56.23%">(u'is',)</span> 
<span style="color: #808080" title="neutral: 61.14%">(u'is', u'only')</span> 
<span style="color: #808080" title="neutral: 56.40%">(u'only',)</span> 
<span style="color: #c0ad00" title="negative: 54.75%">(u'only', u'a')</span> 
<span style="color: #808080" title="neutral: 54.63%">(u'a',)</span> 
<span style="color: #c06500" title="negative: 73.62%">(u'a', u'test')</span> 
<span style="color: #808080" title="neutral: 52.74%">(u'test',)</span> 
<span style="color: #808080" title="neutral: 65.38%">(u'test', '__end__')</span> <br>

Perhaps by the prior probabilities skew the overall classification so much that just a single feature isn't capable of overcoming the priors. Now that I think about it, why are we throwing away the confidence value from the classification process, and re-calculating it from the conditionals? Which is the correct approach?

bwbaugh commented 11 years ago

When we use the original confidence value from the classification process, we get:

<span style="color: #808080" title="neutral: 50.56%">('__start__', u'This')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'This',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'This', u'is')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'is',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'is', u'only')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'only',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'only', u'a')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'a',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'a', u'test')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'test',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'test', '__end__')</span> <br>

Why are there only two unique confidence values across all features? Shouldn't the individual conditional probabilities cause at least some variation?