We have gathered a new data set from students at the University of Michigan and developed a new method for targeted sentiment. We perform both entity extraction and sentiment analysis over extracted entities showing improvements over previous work on a similar task.
Using natural language processing and machine learning techniques we build a pipeline which consists of an entity extraction system, formulated as a sequence labeling task, which feeds into a sentiment analysis classifier which takes a target entity and surrounding text as input and labels expressed sentiment toward the entity as positive, negative, or neutral. The development of new domain specific features for both parts of the pipeline lead us to improvements over several baselines.
If you want to reuse the code for new types of entities you can now define the entity types you wish to support in the settings.py file. You can define entity types using the ENT_TYPES dictionary which has a key for the name of each type. The annotated data must have '<type' lines to identify them. See the following example:
# Kathe Halverson was the only aspect of EECS 555 Parallel Computing that I liked
# <instructor
# name=Kathe Halverson
# sentiment=positive>
# <class
# id=555
# name=Parallel Computing
# sentiment=negative>
ENT_TYPES = {'instructor': ['name'], \
'class': ['name', 'department', 'id'] \
}
The rest of the code will then use these types. The list of identifiers in this dictionary are used to mark token types for the CRF tagger. When merging these into entities it combines identifiers belonging to the same entity type as long as it does not see another identifier of the same type. For example, if you have an utterance that lists '492, the AI class and the data structures class', it will merge the id '492' and the name 'AI' into one class but when it sees the second name, 'data structures' it creates a second entity.
The following packages need to be installed:
pip install python-crfsuite nltk sklearn scipy numpy pyparsing
Furthermore, the project needs Java 8. One way to install it is:
sudo add-apt-repository ppa:webupd8team/java
sudo apt update; sudo apt install oracle-java8-installer oracle-java8-set-default
The Python wrapper we are using for Stanford CoreNLP only works with Python 2.
These steps are required to set up dependencies.
To install pywrapper we ran:
git clone https://github.com/brendano/stanford_corenlp_pywrapper dependencies/
ln -s dependencies/stanford_corenlp_pywrapper .
Extracted Stanford CoreNLP into dependencies folder. The code has been tested with versions 2015-04-20 and 2018-02-27.
The dataset can be downloaded from http://web.eecs.umich.edu/~mihalcea/downloads/targetedSentiment.2017.tar.gz and should be placed in this folder.
These lexicons are in the data folder:
Renamed Bing Liu's lexicon files into neg_words
and pos_words
.
Removed 'm' characters from line 5549 and 5550 in MPQA file.
If you use this code please cite:
@inproceedings{Welch16Targeted,
author = {Welch, C. and Mihalcea, R.},
title = {Targeted Sentiment to Understand Student Comments},
booktitle = {Proceedings of the International Conference on Computational Linguistics (COLING 2016)},
address = {Osaka, Japan},
year = {2016}
}