Closed rebeccabilbro closed 6 years ago
You beat me to it... But here is my attempt at transcribing your notes Sane_Binning.pdf
Has the coding for the corpus reader, TextBormalizer, and Pipeline been completed?
Hey there @lwgray - yes; here's the TextNormalizer
:
class TextNormalizer(BaseEstimator, TransformerMixin):
def __init__(self, language='english'):
self.stopwords = set(nltk.corpus.stopwords.words(language))
self.lemmatizer = WordNetLemmatizer()
def is_punct(self, token):
return all(
unicodedata.category(char).startswith('P') for char in token
)
def is_stopword(self, token):
return token.lower() in self.stopwords
def normalize(self, document):
return [
self.lemmatize(token, tag).lower()
for sentence in document
for (token, tag) in sentence
if not self.is_punct(token)
and not self.is_stopword(token)
]
def lemmatize(self, token, pos_tag):
tag = {
'N': wn.NOUN,
'V': wn.VERB,
'R': wn.ADV,
'J': wn.ADJ
}.get(pos_tag[0], wn.NOUN)
return self.lemmatizer.lemmatize(token, tag)
def fit(self, documents, y=None):
return self
def transform(self, documents):
return [
' '.join(self.normalize(doc)) for doc in documents
]
I should have some time to work on the draft tomorrow, so I'll post the corpus reader and preprocessor then!
If you need any help with the coding, just let me know 😄
I know you have done all the analysis but I produced this so I could better understand the data and possible workflow
@lwgray nice! Would you be interested in taking a crack at pulling the prototype code that @bbengfort wrote into a new ClassBalanceHeatMap
visualizer using the Yellowbrick API? It would be awesome to be able to reference the work-in-progress in my post, and then maybe you could do a follow-up post on creating a new Yellowbrick visualizer?
Here's the prototype code:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics.classification import _check_targets
def plot_class_balance_preds(y_true, y_pred, labels=None, ax=None):
# Use Sklearn tools to validate the target
# Note y_true and y_pred should already be label encoded
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
indices = unique_labels(y_true, y_pred)
# Create a 2D numpy array where each row is the count of
# the predicted classes and each column is the true class
data = np.array([
[(y_pred[y_true==label_t] == label_p).sum() for label_p in indices]
for label_t in indices
])
# Ensure that the number of elements in data matches y_pred and y_true
# Not necessary but used as a sanity check
assert data.sum() == len(y_pred) == len(y_true)
# labels_present is the indices of the classes, labels is the string names
# Another sanity check, this will not prevent missing classes, which is bad
labels = labels if labels is not None else indices
assert len(labels) == len(indices)
# Create a matplotlib axis
if ax is None:
_, ax = plt.subplots()
# Create a unique color for each predict class
colors = [cm.spectral(x) for x in np.linspace(0, 1, len(indices))]
# Track the stack of the bar graph
prev = np.zeros(len(labels))
# Plot each row
for idx, row in enumerate(data):
ax.bar(indices, row, label=labels[idx], bottom=prev, color=colors[idx])
prev += row
# Make the graph pretty
ax.set_xticks(indices)
ax.set_xticklabels(labels)
ax.set_xlabel("actual class")
ax.set_ylabel("number of predicted class")
# Put the legend outside of the graph
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left")
plt.tight_layout(rect=[0,0,0.85,1])
return ax
## Usage
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import GradientBoostingClassifier
digits = load_digits()
X_train, X_test, y_train, y_true = tts(digits.data, digits.target, test_size=0.33)
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
g = plot_class_balance_preds(y_true, y_pred, labels=digits.target_names)
plt.show()
If you're looking for a model for the Yellowbrick PR process, you can check out the one @bbengfort did that @ndanielsen and I are reviewing now.
FYI @marskar @lwgray @yzyzy - Still have some sections to flesh out, but you can see my draft in development here.
I will give it a shot. However, this is my first implementation of a visualizer. I might need your assistance if I get stuck. 😟
Awesome! Sure thing @lwgray -- definitely start by checking out the Visualizer API description in the docs. The best strategy is to open a pull request early that describes and scopes the task so that us maintainers can be ready and available to assist as needed. Look forward to seeing what you come up with!
@rebeccabilbro I read through your drafts 3 times.... I think it is fun and fairly thorough. The only question I have is, will you discuss changing your binning so that the classes are better balanced?
Also, do you think we could talk for 10 minutes at the March 14th meeting, I would like to understand some of your coding decisions.
@rebeccabilbro I read through the draft and think it's great. My question here is it seems to me that you are trying to see if they can build a ClassBalanceHeatmap visualizer so you can see where your prediction goes wrong with different classes in a bar chart. However, this doesn't seem like solve your problem of the non-balanced sample from the beginning when you try to assign them from different scores to classes.
@rebeccabilbro I created a visualizer to help with balanced binning. You can create a balanced binning based on the referenced value created from the visualizer. You can check it via the following link: https://github.com/Juan0001/yellowbrick-balanced-bin-reference/blob/master/balanced_binning.ipynb
I was trying to put the package on yellowbrick, but don't have a permission. Could you help me? Thank you.
@Juan0001 very nice! We'd love to review your work and add it to Yellowbrick. The way we do this is to have you fork the Yellowbrick repository into your own GitHub account; once you've done that you can create a pull request so that we can go over your additions, and once approved we merge them into Yellowbrick.
Detailed instructions are here: Contributing to Yellowbrick but of course we're happy to go over it with you tonight.
Ok, draft now published here.
@rebeccabilbro @Juan0001 To me it isn't obvious what the reader can get out of visiting Juan's jupyter notebook. Maybe say something like "If you are looking for an automated way to create balanced binning then checkout this visualizer in the works"
@lwgray - I believe @Juan0001's notebook is in draft form and she's still working on fleshing out some of the possible use cases.
@rebeccabilbro Thank you very much! @lwgray You are right, the notebook version is not very clear yet. It's first version of the draft. In the later versions I will explain why I created this function, what can it do and how to use it in more details. Please let me know if there's anything else I need to improve, thank you very much.
I. Title Better Binning for Classification Problems: Creating Categorical Values from Continuous Values
II. Premise
III. Dataset Intro
IV. Preliminary Text Analytics Pipeline
V. Tuning Bins with ClassBalance
VI. Conclusion/Teaser for New ClassBalanceHeatmap Visualizer
Reviewers: @marskar @lwgray @yzyzy