Features from best SemEval-2013 participant, NRC-Canada

See: NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets

[ ] all-caps: the number of words with all characters in upper case;
[ ] clusters: presence/absence of tokens from each of the 1000 clusters (provided by Carnegie Mellon University's Twitter NLP tool);
[x] elongated words: the number of words with one character repeated more than 2 times, e.g. 'soooo';
[x] emoticons:
- presence/absence of positive and negative emoticons at any position in the tweet;
- whether the last token is a positive or negative emoticon;
[x] hashtags: the number of hashtags;
[ ] negation: the number of negated contexts. A negated context also affects the ngram and lexicon features: each word and associated with it polarity in a negated context become negated (e.g., 'not perfect' becomes 'not perfect_NEG', 'POLARITY_positive' becomes 'POLARITY_positive_NEG');
[ ] POS: the number of occurrences for each part-of-speech tag;
[ ] punctuation:
- the number of contiguous sequences of exclamation marks, question marks, and both exclamation and question marks;
- whether the last token contains exclamation or question mark;
[ ] sentiment lexicons: automatically created lexicons (NRC Hashtag Sentiment Lexicon, Sentiment140 Lexicon), manually created sentiment lexicons (NRC Emotion Lexicon, MPQA, Bing Liu Lexicon). For each lexicon and each polarity we calculated:
- total count of tokens in the tweet with score greater than 0;
- the sum of the scores for all tokens in the tweet;
- the maximal score;
- the non-zero score of the last token in the tweet; The lexicon features were created for all tokens in the tweet, for each part-of-speech tag, for hashtags, and for all-caps tokens.
[x] word ngrams
[ ] character ngrams.

bwbaugh / infertweet

Features from best SemEval-2013 participant, NRC-Canada #45