2018-ACL (Workshop on Economics and NLP)-Implicit and Explicit Aspect Extraction in Financial Microblogs

Main problem

This paper's main problem is detecting messages' aspects as a sub-task of Aspect-based Sentiment Analysis. Their purpose is to report an extraction method of financial aspects in microblog messages. In fact, they did this work in the new domain, which is finance and economics.

Existing work

Existing methods for detecting aspects can be divided into two categories:

Unsupervised: All the unsupervised methods use lexicons to search for explicit words linked to aspects. These lexicon-based approaches rely on three things :
- Frequency measures used with association measures such as Point-wise Mutual Information (PMI) to link words with lexicon entries.
- Syntactic relations to relate core sentiment words, expressed by adjectives, to target aspect words expressed by nouns.
- Word association measures for topic extractions and clustering methods.
Supervised: These approaches rely on machine learning algorithms that are trained on the instances with labels as aspects and then tested on new instances. While many studies have proposed different types of Conditional Random Fields (CRF) models that differentiate between aspects and non-aspects in text sequences without paying attention to the correlation of the words, other methods identify on the basis of predefined aspects linked to the entity and attribute pairs.

Besides, traditional approaches have focused on explicit aspects without paying attention to implicit ones. So, they focus on word occurrences to determine aspects. However, newer approaches focus on identifying aspects that are implicitly referred to. For instance, there is work done by Dosoula et al. in 2016, who developed an implicit feature algorithm that uses cooccurrences to assign implicit aspects at the sentence level in online restaurant reviews.

Inputs

A collection of messages in the finance domain

Outputs

Aspects in two-level

Example

For example, in “$MCD with declining revenue for a good while,” “declining revenue” is extracted, which is relevant for the classification as “Revenue Down.” Also, “Revenue Down” is a sub-category of “Financial Results.”

Proposed Method

The paper presents two approaches:

1- Distributional Semantics Model (DSM) In this model, basically, they use word embeddings to compute semantic relatedness. This will be done in two steps: Firstly, they use morpho-syntactic patterns to select relevant noun and verb phrases in extracting candidates. Secondly, in computing the relatedness of candidates with the classes, they combine the multi-word candidates into single vectors by getting the sum of the vectors, then they compute the cosine similarity for all possible pairs of tokens in each message and hold the highest score.

2- Supervised Machine Learning Models In this approach, we face a multi-class supervised classification problem that has feature engineering, machine learning algorithm optimization, and model selection and evaluation stages:

Feature engineering: this part includes considering the following features in the vector of messages: Bag of Words (with three types of statistic: binary count, frequency count, and TFIDF), POS tagging, numerical, and predicted sentiment of entity
Machine learning algorithm optimization: four machine learning algorithms are chosen in this part: XGBoost, Random Forests, SVM, and Conditional Random Fields. Also, hyperparameters were tuned using Particle Swarm Optimisation (PSO) method.
Model selection and evaluation: the best model among DSM and Supervised Machine Learning models was chosen (using 10-fold cross-validation), and the selected model was validated using the leave-one-out option method (leave-one-out option method means that we repeat the training on all instances except one of them until all of them have been used as the test instance.

Experimental Setup

Dataset

They have used a corpus of messages specialized in stock trading, which traders posted. This dataset contains 218 messages containing implicit aspects and 150 messages containing explicit aspects. The messages were manually classified by one financial expert, according to the taxonomy that they provided in the appendix section, by matching aspect classes and subclasses with messages. The annotated dataset explained above has been used as the training and test dataset.

Evaluation and Metrics

In the model selection step, they computed global accuracy for 32 classes. However, in the validation step, they used F1-Score for 7 and 32 classes to measure the effects of the coarse and fine-grained annotation levels.

Results

In the model selection step, XGBoost scored the highest accuracy and was selected. The figure below shows the results of XGBoost on aspect classes and aspect subclasses classification, as well as implicit and explicit aspect classification. XGBoost scored 71% accuracy on the 7-aspect class classification, 82% on explicit aspect classification, and 35% on implicit aspect classification.

summary

Code

https://github.com/DeepthiSudharsan/Aspect-Extraction-of-Financial-Microblogs (Unofficial)

Presentation

No presentation was provided.

Criticism

Explicit aspect classification performed well, but the results of implicit aspect classification would have been better. However, they stated that it could be tackled with a larger dataset and better feature engineering. In general, their method is straightforward and needs too many datasets. Also, it was not a new approach, just doing the previous ones (kind of a survey) and testing in the financial domain.

fani-lab / LADy