deepandas11 / HAN-and-Data-Augmentation-Text-Classifier

ECE 539 Academic Project on Toxic Comments Dataset
1 stars 1 forks source link

Exploring Hierarchical Attention based Deep Neural Networks for Toxic Comments Classification

Introduction

Text classification is a task in NLP in which predefined categories are assigned to text documents. The two widely used approaches can be represented as follows:

text-classification-workflow1

One major difference in classical machine learning and deep learning approaches is that in the deep learning approach, feature extraction and classification is carried out together.

Proper feature representation is an important task in any Machine Learning task. In this case, we have text comments that are made of sentences and in return, sentences are made of words. A big challenge for NLP is to represent text in a proper format. We will be using Vector space models to represent the text corpus. The other inefficient method is the Standard Boolean method.

To represent the various features in the text corpus we use the GloVE Embedding vectors. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

We choose GloVE and not random embedding as in this embedding model, there are a few interesting properties:

1. Data Visualization and Feature Analysis

Before we explore the Deep Learning models, we ensure that the dataset is understood well and that it is suitable for applying Deep Learning based methods. Tasks accomplished in this notebook:

2. Baseline LSTM Model with Data Augmentation

Recurrent Neural Networks and its derivatives is a well-researched topic in NLP. RNNs use internal memory to process input sequences. Humans understand each word based on our understanding of a set of previous words because our thoughts are persistence. Traditional Neural networks cant handle this issue, but RNNs allow loops in them, allowing information to exist. The following chain like structure enables RNNs to initimately understand sequences and lists.

rnn-unrolled

Long Term Dependencies: Sometimes, we need to look and understand only the very recent information to perform the present task, e.g. in doing the task of predicting the enxt word in a sentence, we might need a very limited context. However, there may be cases in which we need to have more context. As the gap between the most recent sequence entity and the context grows wider, RNNs are unable to learn to connect the information.

rnn-longtermdependencies

In order to handle such long term dependencies, we turn to LSTMs

LSTM

The key structural change in LSTMs is that they have 4 neural structures in the repeating unit rather than just the one in RNNs. The LSTM enables a cell state to run across the entire network. With the use of Gates, information is optimally let in and is enabled to affect the cell state.

lstm3-chain

Modifications to this LSTM structure can include:

  1. Peephole connections to introduce cell state in all gating operations
  2. Gated Recurrent Units: Combines forget and input gates as one and also merges cell state and hidden state.

To ensure that the model doesn't overfit too soon, we have used the technique of Data Augmentation wherein I enabled the TextBLOB translation for the ENglish dataset to be used as an augmentation to the data. In the notebook, we use Baseline LSTM Model and GloVE embeddings which lead us to the following results:

download1

Reading: Colah's Blog

3. Hierarchical Attention Network Model

Straight Neural Network methods are effective but better representation can be achieved by including knowledge of document structure in the model architecture. This can be understood as:

THe hierarchical attention that can be used here is:

  1. Words from sentences
  2. Sentences from comments.

In the following example we can see that the third sentence delivers a strong meaning and the words amazing and superb provide the most defining sentiment of the sentence.

attention1

The Hierarchical Attention Network is made of several parts:

  1. A word sequence encoder : Get annotations of the words by summarising information from both directions
  2. A word-level attention layer: Not all words contribute equally to a sentence's meaning and this layer extracts such words that are important to the meaning of the sentence and aggregate the representation of those informative words
  3. A sentence encoder: Bidirectional LSTM to encode sentences
  4. A sentence-level attention layer: Rewards sentences that are clues to correctly classify a document

attention2

We use the augmented dataset and achieve the following results:

download

As one can see, the accuracy of the HAN based method is slightly worse than the traditional baseline LSTM model. This can be explained from the following figure. One can notice that obviously, not a lot of comments have a large number of sentences, but even if the comments have a large number of sentences, the length of these comments is just not large enough to derive proper weightings from the attention layer. My understanding is that due to such a limitation in the dataset, one gets such a result from the HAN model.

download 1