Feature engineering - Githubissues

gabrielpreda commented 5 years ago

Continue the exploratory data analysis, perform feature engineering, add sentiment analysis-based features

vitalie-cracan commented 5 years ago

fyi

I tried few approaches that use GloVe word representations (https://nlp.stanford.edu/projects/glove/, glove.6B.300d), but none achieved highers score than the current ones, some significantly lower (e.g. business_service).

Approach 1:

Use the mean representation for subject and the mean representation for body (so treat them as BOW). Concatenate the two vectors and use as features.

Approach 2:

Use tfidf score of words from TfidfVectorizer as weight when summing up vector representations of words in the body. Use the resulted vector representations of body as features.

It was a surprise for me, I was expecting GloVe representations to bear more information. Searching the net, it looks like others have tried simmilar approaches (even training GloVe on the train data corpus), only to discover same. Tfidf scores for words produce best results.

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

In these approaches LogisticRegression had highest score, but is slow. SVM is the next best (quite close), but faster. LGB produced much lower scores.

vitalie-cracan commented 5 years ago

Code for second approach (I did not keep the one for the first, but I can restore it if needed):


import numpy as np
import pandas as pd
#from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

class Glove():
    DEFAULT_FILE_PATH = "datasets/glove.6B.300d.txt"
    WORD_VECTOR_DIMENSION = 300

    glove_vectors = {}
    not_found_words = []
    frequent_words = ['the', 'a', 'be', 'and', 'of', 'in', 'to', 'have', 'i', 'that', 'for', 'you', 'he', 'with', 'on',
                     'dear', 'hi', 'hello', 'best', 'regards', 'thanks', 'thank', 'please']
    def __init__(self):
        print("Loading Glove vectors")
        self.loadWordVectors()

    def loadWordVectors(self):
        with open(self.DEFAULT_FILE_PATH, 'r', encoding='utf-8') as file:
            for line in file:
                row = line.split()
                self.glove_vectors[row[0].strip()] = np.array(row[1:]).astype(float)

    def wordToVector(self, word):
        zero = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)
        if word in self.frequent_words:
            return zero

        word_vector = self.glove_vectors.get(word)
        if word_vector is not None:
            return word_vector

        return zero

    def textToVector(self, text):

        vector_sum = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)

        if isinstance(text, float): # nan
            return vector_sum

        if type(text) != np.ndarray: 
            text = text.strip().split()

        for word in text:
            vector_sum += self.wordToVector(word)

        return vector_sum

    def subjBodyToVector(self, subject, body):
        subject_vector = self.textToVector(subject)
        body_vector = self.textToVector(body)
        return np.concatenate([subject_vector, body_vector])

glove = Glove()

class GloveVectorizer(TfidfVectorizer):        

    def fit_transform(self, X, y = None):
        return self.transform(X, y)

    def toGlove(self, pair):
        (i, words) = pair
        result = np.zeros(glove.WORD_VECTOR_DIMENSION, dtype=float)
        for word in words:
            j = self.vocabulary_[word]
            result += self.tfidf[i, j] * glove.wordToVector(word)
        return result

    def transform(self, X, y = None):
        self.tfidf = super().fit_transform(X, y)
        newX = self.inverse_transform(self.tfidf)
        return pd.DataFrame(data = list(map(self.toGlove, enumerate(newX))))

To use it, simply replace CountVectorizer + TfidfTransformer (=TfidVectorizer) with GloveVectorizer.

The helper functions in Glove do not use averages, rather a simple sum. GloveVectorizer uses weighted sum.

gabrielpreda commented 5 years ago

It is a surprise for me to hear about lowest score with LGB. You might want to push this work adding Glove as one of the options on the pre-processing part of the ML pipeline. It is worth keeping them, we might be able to improve on these later on (or even start building an ensamble approach).

vitalie-cracan commented 5 years ago

Today I tried FastText: https://fasttext.cc/docs/en/supervised-tutorial.html

I used category as label and body for text.

./fasttext supervised -input examples/stc/tickets.train -output examples/stc/model -lr 0.1 -epoch 25 -wordNgrams 2

tickets.train contain 40000 items and the rest, 8538

./fasttext test examples/stc/model.bin examples/stc/tickets.valid
N       8538
P@1     0.821
R@1     0.821
Number of examples: 8538

Looks like a good score, but not much better than what we have already. The advantage is that fasttext is indeed very fast to train.

Note: I will clean things up and push the Glove trials, hopefully some time next week.

vitalie-cracan commented 5 years ago

@gabrielpreda I do not have permissions to push a new branch, maybe you could restrict the master branch but allow me to create new branches?

gabrielpreda / Support-Tickets-Classification

Feature engineering #4