Closed Naxter closed 7 years ago
Both is implemented. Actual version only uses the data of the database which is "just" the new crawled data. We now need to import the dataset to the database (in implementation).
This will result also in easier statistic creation, easy filtering without much code.
We can then also run experiments easier.
Parser that parses the csv files to the Database is also implemented.
So now we can calculate sentiment and emotion for all the posts and comments.
The approach is: (done for posts and also for the comments and both saved in the post table) Emotion Mining -> calculate the sum of the emotions for the words that can be found in the dictionary. This is how we can find a vector with the distribution of the emotions. Sentiment Analysis -> use the CoreNLP for sentiment calculation. CoreNLP calculates the sentiment for each sentence. This is also summed up to get a end result. (Negative, Verynegative, positive, Verypositive, neutral). CoreNLP does "annotators, tokenize, ssplit, pos, parse, sentiment" to calculate the sentiment.
Perhaps a different lexicon that can be used (instead of Emolex) is this one from Stanford which comes from Reddit.
https://nlp.stanford.edu/projects/socialsent/
Might be more relevant to facebook posts.
Yeah, will try it with this one. Sounds good
Using this approach: Exploiting a Bootstrapping Approach for Automatic Annotation of Emotions in Texts
For adding negation handling in emotion detection, one need to find out how a negative "word" can influence the outcome of the emotion of a sentence.
Does it invert the complete emotions? Does it account for a negative or positive emotion? Or even for all negative emotions/positive emotions?
You can look up literature on this.
e.g. from this paper (although I wouldn't trust it much)
(non exhaustive list of negations) no, not, rather, couldn’t, wasn’t, didn’t, wouldn’t, shouldn’t, weren’t, don’t, doesn’t, haven’t, hasn’t, won’t, wont, hadn’t, never, none, nobody, nothing, neither, nor, nowhere, isn’t, can’t, cannot, mustn’t, mightn’t, shan’t, without, needn’t,
(diminisher) hardly, less, little, rarely, scarcely, seldom
I think that most sophisticated negation handling algorithms require dependency parsing (to actually check the range of the negation and which words it affects), but since it's short text (posts/comments) in our case, should be also ok to include some rules, e.g.
Negation + pos. word -> (somehow increase negative sentiment) Negation + neg. word -> (somehow increase positive sentiment)
Thanks!
I am considering to use POS-tags into account for the "rules" of negation handling, to also get things like "not very happy". Otherwise things like "really, very, ..." are ignored and will "crash" the negation handling.
For the simple sentence similarity, I was wondering how to get the best results. And actually found an interesting approach (a recent one): https://openreview.net/pdf?id=SyK00v5xx with a simple implementation found here: https://github.com/peter3125/sentence2vec This approach sounds really promising. I think that I will try this one! found here (last answer): https://stackoverflow.com/questions/22129943/how-to-calculate-the-sentence-similarity-using-word2vec-model-of-gensim-with-pyt All this is used to extend the amount of annotated sentences. (non-annotated sentences are annotated when similarity is > 0.8)
If there are still non-annotated sentences:
I am going to use a OneVsRestClassifier with a Linear SVM and multilabel classification to annotate the sentences that does not contain a single word from the emotion lexicon. I hope the recall/precision of the system is going to be at least >70%. Otherwise this will not be a good model. Also, I am going to use TF-IDF values and not WEKA and suggested in the paper. (TF-IDF is still the best approach)
Indeed, this paper (of the ICLR) is a very interesting approach. It's also kind of new (2017), so hasn't been widely applied yet. My point is that if it becomes too complex, then just go with averaging word vectors.
Tobias and me extended the EmoLex by using WordNet synonyms. The synonyms have been integrated into the database by using the same emotion vector as the original looked-up word. The database has increased from 14181 to 31485
We also extended the current emotion miner such that it uses simple negation handling. We are using a list of negation pre- and suffixes. Prefixes: ["a", "de", "dis", "il", "im", "in", "ir", "mis", "non", "un"] Suffixes: ["less"]
The first rule is used when a negation word is instantly followed by the emotion-word (Word that is present in our emotion database).
The second rule tries to handle adverbs and past particle verbs (pos-tags: RB, VBN). If a negation word is followed by one or more of these pos-tags and a following emotion-word, the emotion-word will still be negated.
There are two ways how we can obtain the emotions of a negated word
Anger | Anticipation | Disgust | Fear | Joy | Sadness | Surprise | Trust | |
---|---|---|---|---|---|---|---|---|
Anger | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Anticipation | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
Disgust | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
Fear | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
Joy | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
Sadness | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Surprise | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Trust | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Example: Emotion of a word: [0,0,1,1,0,1,0,0] New emotion after negation of "Disgust": [0,0,0,0,0.5,0,0,0.5] New emotion after negation of "Fear": [0,0,0,0,1,0,0,1] New emotion after negation of "Sadness": [0,0,0,0,2,0,0,1]
Moreover, we've added the Sentence2Vec code mentioned in the bootstrapping paper together with an averaging word vector approach for comparison. Both approaches return similar similarity scores. The problem that we've encountered is that two sentences with different emotions but same structure are measured as nearby.
Example: Sentence 1: "I really love your car." Sentence 2: "I really hate your car." Sentence2Vec similarity: 0.9278 Avg vector similarity: 0.9269
This high similarity is problematic since the emotions of the two sentences are completely different. One can see that the two models are really equal and for now we cannot see any advantage of the sentence2vec approach over the simple average vector approach.
Furthermore, we've added a SVM-implementation. This is used to annotate all sentences that couldn't been tagged by the emotion miner. It uses the sklearn multilabel OneVsRestClassifier with a LinearSVM taking tf-idf values as input. The input consists of a single sentence as data and an array of 8 values representing the emotions as label. With a training-split of 95/5, we currently get an average precision recall of about 0.93 not using the similarity scores.
When the neural networks are ready, they can be combined with the results of emotion mining. For example with a linear regression or something similar.
I am going to tag the rest of the non-annotated sentences with the SVM now and save the in the database and also use that to finally save the emotion distribution for a post (from the comments)
As we mined all the emotions now and sentiments, this task is done.
As we said in the first phase presentation we wanted to do sentiment analysis. In addition to that we decided to do emotion mining also.
Emotion Mining is done with the EmoLex (NER) dictionary and the Sentiment Analysis (first approach) is done with the Stanford CoreNLP library. We need to figure out if this approach is sufficient for this dataset. Otherwise we need to watch out for an other sentiment library that is trained for "slang", etc.