Demonstration/walkthrough proposal on how to use ClassBalance (and variations) to bin continuous values for classification

rebeccabilbro commented 6 years ago

I. Title Better Binning for Classification Problems: Creating Categorical Values from Continuous Values

II. Premise

A lot of machine learning problems in the real world suffer from the curse of dimensionality; you have fewer training instances than you’d like, and predictive signal is distributed (often unpredictably!) across many different features.
One example is when your target is continuously-valued, but there aren’t enough instances to predict these values to the precision of regression.
What if we transform the regression problem into a classification problem? We can try to do this by binning the continuous values into buckets for classification. But how do we pick the bins?

III. Dataset Intro

About the Pitchfork album reviews corpus - funny! snarky! sentiment analysis??
Download the data from https://www.kaggle.com/nolanbconaway/pitchfork-data/data
Custom CorpusReader to access the text and scores
Custom TextNormalizer to lemmatize and remove stop words
use Numpy digitize method to naively bin the continuous target values

IV. Preliminary Text Analytics Pipeline

Build using Scikit-Learn Pipeline
use ConfusionMatrix to visually evaluate
use ClassBalance to visualize imbalance
talk through selection bias - why initial bins didn’t work

V. Tuning Bins with ClassBalance

redo with better distributed bins for target values
(hopefully) show better results

VI. Conclusion/Teaser for New ClassBalanceHeatmap Visualizer

how to combine insight from ConfusionMatrix with interpretability of ClassBalance?

Reviewers: @marskar @lwgray @yzyzy

lwgray commented 6 years ago

You beat me to it... But here is my attempt at transcribing your notes Sane_Binning.pdf

lwgray commented 6 years ago

Has the coding for the corpus reader, TextBormalizer, and Pipeline been completed?

rebeccabilbro commented 6 years ago

Hey there @lwgray - yes; here's the TextNormalizer:

class TextNormalizer(BaseEstimator, TransformerMixin):

    def __init__(self, language='english'):
        self.stopwords  = set(nltk.corpus.stopwords.words(language))
        self.lemmatizer = WordNetLemmatizer()

    def is_punct(self, token):
        return all(
            unicodedata.category(char).startswith('P') for char in token
        )

    def is_stopword(self, token):
        return token.lower() in self.stopwords

    def normalize(self, document):
        return [
            self.lemmatize(token, tag).lower()
            for sentence in document
            for (token, tag) in sentence
            if not self.is_punct(token)
               and not self.is_stopword(token)
        ]

    def lemmatize(self, token, pos_tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(pos_tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

    def fit(self, documents, y=None):
        return self

    def transform(self, documents):
        return [
            ' '.join(self.normalize(doc)) for doc in documents
        ]

I should have some time to work on the draft tomorrow, so I'll post the corpus reader and preprocessor then!

lwgray commented 6 years ago

If you need any help with the coding, just let me know 😄

lwgray commented 6 years ago

I know you have done all the analysis but I produced this so I could better understand the data and possible workflow

Confusion Matrix and Class Balance: Before Class Adjustment

confusionmatrix-before classbalance-before

Confusion Matrix and Class Balance: After Class Adjustment

confusionmatrix-after classbalance-after

rebeccabilbro commented 6 years ago

@lwgray nice! Would you be interested in taking a crack at pulling the prototype code that @bbengfort wrote into a new ClassBalanceHeatMap visualizer using the Yellowbrick API? It would be awesome to be able to reference the work-in-progress in my post, and then maybe you could do a follow-up post on creating a new Yellowbrick visualizer?

Here's the prototype code:

import numpy as np 
import matplotlib.pyplot as plt 

from matplotlib import cm 
from sklearn.utils.multiclass import unique_labels 
from sklearn.metrics.classification import _check_targets

def plot_class_balance_preds(y_true, y_pred, labels=None, ax=None):
    # Use Sklearn tools to validate the target 
    # Note y_true and y_pred should already be label encoded 
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    indices = unique_labels(y_true, y_pred)

    # Create a 2D numpy array where each row is the count of 
    # the predicted classes and each column is the true class 
    data = np.array([
        [(y_pred[y_true==label_t] == label_p).sum() for label_p in indices]
        for label_t in indices 
    ])

    # Ensure that the number of elements in data matches y_pred and y_true 
    # Not necessary but used as a sanity check
    assert data.sum() == len(y_pred) == len(y_true)

    # labels_present is the indices of the classes, labels is the string names 
    # Another sanity check, this will not prevent missing classes, which is bad
    labels = labels if labels is not None else indices
    assert len(labels) == len(indices)

    # Create a matplotlib axis 
    if ax is None:
        _, ax = plt.subplots()

    # Create a unique color for each predict class 
    colors = [cm.spectral(x) for x in np.linspace(0, 1, len(indices))]

    # Track the stack of the bar graph 
    prev = np.zeros(len(labels))

    # Plot each row 
    for idx, row in enumerate(data):
        ax.bar(indices, row, label=labels[idx], bottom=prev, color=colors[idx])
        prev += row 

    # Make the graph pretty 
    ax.set_xticks(indices)
    ax.set_xticklabels(labels)
    ax.set_xlabel("actual class")
    ax.set_ylabel("number of predicted class")

    # Put the legend outside of the graph 
    plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left")
    plt.tight_layout(rect=[0,0,0.85,1])

    return ax

## Usage 
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import GradientBoostingClassifier

digits = load_digits()
X_train, X_test, y_train, y_true = tts(digits.data, digits.target, test_size=0.33)

model = GradientBoostingClassifier() 
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
g = plot_class_balance_preds(y_true, y_pred, labels=digits.target_names)

plt.show()

If you're looking for a model for the Yellowbrick PR process, you can check out the one @bbengfort did that @ndanielsen and I are reviewing now.

rebeccabilbro commented 6 years ago

FYI @marskar @lwgray @yzyzy - Still have some sections to flesh out, but you can see my draft in development here.

lwgray commented 6 years ago

I will give it a shot. However, this is my first implementation of a visualizer. I might need your assistance if I get stuck. 😟

rebeccabilbro commented 6 years ago

Awesome! Sure thing @lwgray -- definitely start by checking out the Visualizer API description in the docs. The best strategy is to open a pull request early that describes and scopes the task so that us maintainers can be ready and available to assist as needed. Look forward to seeing what you come up with!

lwgray commented 6 years ago

@rebeccabilbro I read through your drafts 3 times.... I think it is fun and fairly thorough. The only question I have is, will you discuss changing your binning so that the classes are better balanced?

Also, do you think we could talk for 10 minutes at the March 14th meeting, I would like to understand some of your coding decisions.

Juan0001 commented 6 years ago

@rebeccabilbro I read through the draft and think it's great. My question here is it seems to me that you are trying to see if they can build a ClassBalanceHeatmap visualizer so you can see where your prediction goes wrong with different classes in a bar chart. However, this doesn't seem like solve your problem of the non-balanced sample from the beginning when you try to assign them from different scores to classes.

Juan0001 commented 6 years ago

@rebeccabilbro I created a visualizer to help with balanced binning. You can create a balanced binning based on the referenced value created from the visualizer. You can check it via the following link: https://github.com/Juan0001/yellowbrick-balanced-bin-reference/blob/master/balanced_binning.ipynb

I was trying to put the package on yellowbrick, but don't have a permission. Could you help me? Thank you.

bbengfort commented 6 years ago

@Juan0001 very nice! We'd love to review your work and add it to Yellowbrick. The way we do this is to have you fork the Yellowbrick repository into your own GitHub account; once you've done that you can create a pull request so that we can go over your additions, and once approved we merge them into Yellowbrick.

Detailed instructions are here: Contributing to Yellowbrick but of course we're happy to go over it with you tonight.

rebeccabilbro commented 6 years ago

Ok, draft now published here.

lwgray commented 6 years ago

@rebeccabilbro @Juan0001 To me it isn't obvious what the reader can get out of visiting Juan's jupyter notebook. Maybe say something like "If you are looking for an automated way to create balanced binning then checkout this visualizer in the works"

rebeccabilbro commented 6 years ago

@lwgray - I believe @Juan0001's notebook is in draft form and she's still working on fleshing out some of the possible use cases.

Juan0001 commented 6 years ago

@rebeccabilbro Thank you very much! @lwgray You are right, the notebook version is not very clear yet. It's first version of the draft. In the later versions I will explain why I created this function, what can it do and how to use it in more details. Please let me know if there's anything else I need to improve, thank you very much.

DistrictDataLabs / yellowbrick

Demonstration/walkthrough proposal on how to use ClassBalance (and variations) to bin continuous values for classification #312

Confusion Matrix and Class Balance: Before Class Adjustment

Confusion Matrix and Class Balance: After Class Adjustment