Infer information from Tweets. Useful for human-centered computing tasks, such as sentiment analysis, location prediction, authorship profiling and more!
[ ] all-caps: the number of words with all characters in upper case;
[ ] clusters: presence/absence of tokens from each of the 1000 clusters (provided by Carnegie Mellon University's Twitter NLP tool);
[x] elongated words: the number of words with one character repeated more than 2 times, e.g. 'soooo';
[x] emoticons:
presence/absence of positive and negative emoticons at any position in the tweet;
whether the last token is a positive or negative emoticon;
[x] hashtags: the number of hashtags;
[ ] negation: the number of negated contexts. A negated context also affects the ngram and lexicon features: each word and associated with it polarity in a negated context become negated (e.g., 'not perfect' becomes 'not perfect_NEG', 'POLARITY_positive' becomes 'POLARITY_positive_NEG');
[ ] POS: the number of occurrences for each part-of-speech tag;
[ ] punctuation:
the number of contiguous sequences of exclamation marks, question marks, and both exclamation and question marks;
whether the last token contains exclamation or question mark;
[ ] sentiment lexicons: automatically created lexicons (NRC Hashtag Sentiment Lexicon, Sentiment140 Lexicon), manually created sentiment lexicons (NRC Emotion Lexicon, MPQA, Bing Liu Lexicon). For each lexicon and each polarity we calculated:
total count of tokens in the tweet with score greater than 0;
the sum of the scores for all tokens in the tweet;
the maximal score;
the non-zero score of the last token in the tweet;
The lexicon features were created for all tokens in the tweet, for each part-of-speech tag, for hashtags, and for all-caps tokens.
See: NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets