floorkouwenberg / TMCI_project

This repository contains our project for the course Text Mining and Collective Intelligence
0 stars 0 forks source link

Project Update 1 #1

Open laniepreston opened 4 years ago

laniepreston commented 4 years ago

Week Summary This week, Jasmijn and Floor fixed some bugs with the lemmatizing and pre-processing pipeline.

Floor started working on the Naive Bayes classifier we will use for this project; the code can be found in the "Machine Learning" notebook. So far she has made a test/train split and identified and graphed candidate features.

Lanie started working with SentiWordNet to do sentiment analysis on the co-occurring words in the reviews. This code can be found in the "Cooccurrences" notebook. So far the part of speech conversions and finding co-occurrences has been done.

Questions

  1. I (Lanie) am a bit confused on how to handle negation words during sentiment analysis, as they are not a part of speech and would make the positive and negative scores the opposite of what they're supposed to be. Is there a special way to handle this with SentiWordNet, or do I need to build a helper function for this?
Giovanni1085 commented 4 years ago
  1. I (Lanie) am a bit confused on how to handle negation words during sentiment analysis, as they are not a part of speech and would make the positive and negative scores the opposite of what they're supposed to be. Is there a special way to handle this with SentiWordNet, or do I need to build a helper function for this?

Good question.

A simple way to better account for negations is to include negation n-grams in your data representation. For example, using trigrams or even four-grams starting with a negation. Example: "I do not like it" and "I do not find it bad". If you use "not like it" and "not find it bad", probably that will help. There are also other ways, e.g., using dependency parsing for negations, which could allow you to create appropriate features for negations.

Another way to deal with negations is with neural network architectures, using recurrent neural networks or other architectures. These can capture dependencies over long sequences of text, even if they represent each word with an embedding, as we know it. I recommend you at least take a look at this option during your project, even if I don't expect you work on it necessarily. Here some resources:

  1. An intro to LSTMs, probably the most popular recurrent architecture: https://colah.github.io/posts/2015-08-Understanding-LSTMs.
  2. A great set of tutorials on using PyTorch for sentiment analysis: https://github.com/bentrevett/pytorch-sentiment-analysis. Please check the first one, even just for your info: https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb.