[SA-2] Gather Labeled Data

lydarren commented 6 years ago

Gather text that are labeled based on the text's tone/mood.

-Have a sufficiently large dataset downloaded and in a parseable format for sentiment analysis

danqo commented 6 years ago

Kaggle Twitter Sentiment Analysis Source: https://www.kaggle.com/c/twitter-sentiment-analysis2/data Training size: 100k Testing size: 300k Classification: 0 or 1 Format: csv Dropbox link for Google Colab: https://www.dropbox.com/s/h5nq52vf1rorl8z/kaggle_twitter_sentiment.zip?dl=0

Stanford Twitter Dataset Source: http://help.sentiment140.com/for-students/ Also useful: https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90 Testing size: 498 Training size: 1.6m Classification: 0 = negative, 2 = neutral, 4 = positive Format: csv Dropbox link for Google Colab: https://www.dropbox.com/s/lk7smva4l9bkimf/trainingandtestdata.zip?dl=0

Also noticed that NLTK and Naive Bayes were heavily referenced in my searches. NLTK is a tool we can consider using and Naive Bayes can be an option to compete with LSTM.

NLTK link: http://www.nltk.org/ Stanford's Sentiment Analysis Paper: https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

lydarren commented 6 years ago

I like the 3 classes for the Stanford Twitter Dataset because of the neutral class so we can see if the text leans more to the positive or negative side. If we use NLTK, we can do a quick prototype using their built in Vader sentiment analyzer before we train the Naive Bayes classifier. NLTK also comes with a Naive Bayes Classifier that we can train. Here is an example of training a NB classifier and also the usage of their Vader Sentiment Analyzer: http://www.nltk.org/howto/sentiment.html

danqo commented 6 years ago

I agree with Darren in that the 3 class classification seems to fit our needs better. The customer's report does not have to be exclusively negative or positive.

A lot of the sentiment data sets floating around were more business / product review oriented. Twitter sentiment data fits our needs better because it's closer related in context. That is why I looked for Twitter Sentiment data sets. Tweets capture a "moment", which is arguably similar to our use-case.

E.g. Someone tweets about something that just happened is somewhat similar to someone reporting about their day.

To achieve better results, we should also do a fair amount of pre-processing. Some of the ones I've thought about are:

Remove twitter tags before analysis, because twitter tags won't exist in our data.
If we could correct spelling, that would be preferred as well. AVS "NLP" will not likely use abbreviations or misspell words because it's mapping voice to a dictionary of words.
Emoticons should be removed, because the user can't express emoticons through voice.

lydarren commented 6 years ago

For spelling correction, we can probably use k-gram indexes or edit distance. I think one problem we would be the word that registered would not make sense because one test was "I aced my exam", but AVS took that as "I ate my exam". One problem is edit distance algorithm is n^2 time and space and k-gram is similar. From what it seems, it would be preferred to use context sensitive spelling correction.

This part of the IR book goes into it a bit: https://nlp.stanford.edu/IR-book/html/htmledition/spelling-correction-1.html

Some other pre-processing we can probably do is remove some stopwords in the text (words that does not have meaning. ex: a, the, etc) or stemming (removing the 'stem' of certain words. ex: waiting -> wait, waited -> wait, etc). This would make it easier to figure out the classification, but I am not sure how this will affect what Alexa will respond back with.

MyMood-Alexa / MyMood

[SA-2] Gather Labeled Data #17