Sotera / watchman

Watchman: An open-source social-media event-detection system
GNU General Public License v2.0
20 stars 7 forks source link

Pointwise mutual information #69

Open drJAGartner opened 7 years ago

drJAGartner commented 7 years ago

As an alternative to the current method of non-hashtag sentiment clustering, we can try to perform pointwise mutual information scores on word bigrams. For non-stopwords, we can assess what the pointwise mutual information is. Similarly to how we create hashtag clustering, we can assess the likelihood of creating such a high PMI score, and from there choose to include it in our graph.

lukewendling commented 7 years ago

@drJAGartner can u explain. maybe some examples.

drJAGartner commented 7 years ago

Pointwise mutual information is a measure used in information theory to describe how closely related word pairs are: https://en.wikipedia.org/wiki/Pointwise_mutual_information

If you look at the Applications portion, you can see what this looks like, that words that almost always appear together (i.e. Puerto & Rico) have high scores. If we find a pair of words that change in their occurrence (i.e. pray-paris, irish-water, crane-hadge), it would be a good way of identifying unique speech patterns.

lukewendling commented 7 years ago

sounds so fancy, yet so simple.