Infer information from Tweets. Useful for human-centered computing tasks, such as sentiment analysis, location prediction, authorship profiling and more!
Below are some ideas for experimental variables. Each item should be marked as completed if it is currently possible to experiment on these values using the current implementation.
Filter tweet content
[ ] Concatenating negation tokens (such as "not") with the following token
[ ] Stop words
Synsets
[ ] Extract all for each token
[ ] Extract just one (WSD)
Filter some elements from tweet text (altogether or to a token)
[ ] username mentions (@username)
[ ] URLs (http://...)
[ ] ReTweets (RT)
[ ] Numbers
However, should this really be done? It might be better to use feature hashing and leave those elements in. It's worth experimenting with.
Other variables
[ ] Bits used for table in hashing trick (so, always 2 \ bits)
[x] Laplace smoothing value
[ ] Using the text of Wikipedia as a source of objective training text
[x] Hierarchical classification or a single classifier with all classes
[ ] Multinomial vs Bernouli
[ ] Ignore features that haven't been seen by [any | current] class (which is equivalent to marginalizing over (source))
[x] Ignore prior probabilities (make uniform)
Semi-supervised approaches
[ ] Co-training or multi-view
Split the vocabulary (feature space) into 2 or more subsets
Though Naive Bayes outputs a "probability", with increased document length the probability tends to limit towards 0 or 1.
Can either use a standard approach of using the probability anyway, and adding the most confident examples
Or, in a streaming fashion, if the ensemble agrees then use the instance for training, otherwise discard
Below are some ideas for experimental variables. Each item should be marked as completed if it is currently possible to experiment on these values using the current implementation.
Filter tweet content
@username
)http://...
)RT
)However, should this really be done? It might be better to use feature hashing and leave those elements in. It's worth experimenting with.
Other variables
Semi-supervised approaches
0
or1
.