Ideas for experimental variables for sentiment analysis

Below are some ideas for experimental variables. Each item should be marked as completed if it is currently possible to experiment on these values using the current implementation.

Filter tweet content

[ ] Concatenating negation tokens (such as "not") with the following token
[ ] Stop words
Synsets
- [ ] Extract all for each token
- [ ] Extract just one (WSD)
Filter some elements from tweet text (altogether or to a token)
- [ ] username mentions (@username)
- [ ] URLs (http://...)
- [ ] ReTweets (RT)
- [ ] Numbers

However, should this really be done? It might be better to use feature hashing and leave those elements in. It's worth experimenting with.

Other variables

[ ] Bits used for table in hashing trick (so, always 2 \ bits)
[x] Laplace smoothing value
[ ] Using the text of Wikipedia as a source of objective training text
[x] Hierarchical classification or a single classifier with all classes
[ ] Multinomial vs Bernouli
[ ] Ignore features that haven't been seen by [any | current] class (which is equivalent to marginalizing over (source))
[x] Ignore prior probabilities (make uniform)
Semi-supervised approaches
[ ] Co-training or multi-view
- Split the vocabulary (feature space) into 2 or more subsets
- Though Naive Bayes outputs a "probability", with increased document length the probability tends to limit towards 0 or 1.
- Can either use a standard approach of using the probability anyway, and adding the most confident examples
- Or, in a streaming fashion, if the ensemble agrees then use the instance for training, otherwise discard

bwbaugh / infertweet

Ideas for experimental variables for sentiment analysis #1

Filter tweet content

Other variables

Semi-supervised approaches