DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.27k stars 556 forks source link

ConfusionMatrix asking for label in y_true even though integer list is provided #614

Closed edesz closed 6 years ago

edesz commented 6 years ago

Describe the issue

I am new to YellowBrick but am enjoying it so far - it;s been really great and easy to pick up. I have a question about generating a Confusion Matrix.

I am using the built-in game dataset from the learning curve doc example for Classification and I am trying to generate a confusion matrix. I am using the same code from the Confusion Matrix example docs here:

# Generating dataset
data = load_data('game')

# Specify the features of interest and the target
target = "outcome"
features = [col for col in data.columns if col != target]

# Encode the categorical data with one-hot encoding
X = pd.get_dummies(data[features])
y = data[target]

# Get unique classes
classes = y.unique().tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.1, random_state=42)

model = LogisticRegression()

# Convert unique classes (strings) into integers
classes = list(LabelEncoder().fit_transform(pd.Series(classes)))

# The ConfusionMatrix visualizer taxes a model
cm = ConfusionMatrix(model, classes=classes)

# Fit fits the passed model.
cm.fit(X_train, y_train)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

The y variable is a column of strings and so classes is a list of strings. I am using LabelEncoder from scikit-learn to convert this list classes to a list of integers (the new list is also named classes). This is similar to the ConfusionMatrix documentation example where classes=[0,1,2,3,4,5,6,7,8,9].I then pass the list of integers to the ConfusionMatrix visualizer.

When I run the above code, I get this error message


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-071c8432b394> in <module>()
     10 # To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
     11 # and then creates the confusion_matrix from scikit-learn.
---> 12 cm.score(X_test, y_test)
     13 
     14 # How did we do?

~/anaconda3_501/lib/python3.6/site-packages/yellowbrick/classifier/confusion_matrix.py in score(self, X, y, **kwargs)
    175         # Compute the confusion matrix and class counts
    176         self.confusion_matrix_ = confusion_matrix_metric(
--> 177             y, y_pred, labels=self.classes_, sample_weight=self.sample_weight
    178         )
    179         self.class_counts_ = self.class_counts(y)

~/anaconda3_501/lib/python3.6/site-packages/sklearn/metrics/classification.py in confusion_matrix(y_true, y_pred, labels, sample_weight)
    257         labels = np.asarray(labels)
    258         if np.all([l not in y_true for l in labels]):
--> 259             raise ValueError("At least one label specified must be in y_true")
    260 
    261     if sample_weight is None:

ValueError: At least one label specified must be in y_true

I see that it is ignoring the list of integer classes I provided. The Confusion Matrix example dataset runs fine with no error (also using a list of integers).

Do I need to provide another input in order to overcome this error?

Here's the details about packages

      jupyter notebook --version = 5.0.0
      jupyter lab --version = 0.27.0
      python --version = Python 3.6.3
      yellowbrick --version = 0.8
bbengfort commented 6 years ago

@edesz we were actually just working on the documentation for this feature, which you can read in the development documentation: Confusion Matrix: Plotting with Class Names.

When you fit with integer classes but specify class names, the ConfusionMatrix visualizer requires a mapping of integer to class name. You can give the visualizer a label_encoder which can either be a sklearn.preprocessing.LabelEncoder or it can be a python dictionary.

My suggestion for your code is as follows:

# Encode the categorical data with one-hot encoding
X = pd.get_dummies(data[features])

# Convert unique classes (strings) into integers
encoder = LabelEncoder()
y = encoder.fit_transform(data[target])

# Create test and train splits 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.1, random_state=42)

# The ConfusionMatrix visualizer taxes a model
model = LogisticRegression()
cm = ConfusionMatrix(model, classes=encoder.classes_, label_encoder=encoder)

Alternatively, if you do not encode y and instead pass in string values, the LogisticRegression will take care of the encoding under the hood.

edesz commented 6 years ago

Hi @bbengfort , many thanks for your reply! I had read this in the current docs but I was trying it out with only label_encoder=encoder and ommitting the classes=encoder.classes_ part (which is required) since I was actually incorrectly thinking the label_encoder argument would have been sufficient on its own. So I didn't try anything further. Thanks for he new documentation example - it definitely helps for this case.

Your reply makes sense and this definitely answers my question.

haseebkhan1421 commented 5 years ago

Hi I'm new to yellowbricks and trying to explore things. I have tried everything but couldn't get why am I facing this issue in generating my confusion matrix. Only a part of code is shown below to give you understanding

from sklearn.model_selection import train_test_split
FeatureData_Train, FeatureData_Test, TargetData_Train, TargetData_Test = train_test_split(FeatureData,TargetData, test_size = 0.30, random_state = 10)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
neighbor=KNeighborsClassifier(n_neighbors=3)                     # Creating an Object of KNN Classifier
neighbor.fit(FeatureData_Train,TargetData_Train)                 # Training the model to classify

PredictionData=neighbor.predict(FeatureData_Test)                # Predicting the Response
print ("KNeighbors accuracy score : ",accuracy_score(TargetData_Test, PredictionData))

from yellowbrick.classifier import ConfusionMatrix

cm = ConfusionMatrix(neighbor, classes=['0','1'])

cm.fit(FeatureData_Train,TargetData_Train)

cm.score(FeatureData_Test,TargetData_Test)

Error :

C:\Users\Strat Com\PycharmProjects\IGN Review\venv\lib\site-packages\sklearn\metrics\classification.py:261: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if np.all([l not in y_true for l in labels]):
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-927-3a3d9e9d43f9> in <module>
----> 1 cm.score(FeatureData_Test,TargetData_Test)

~\PycharmProjects\IGN Review\venv\lib\site-packages\yellowbrick\classifier\confusion_matrix.py in score(self, X, y)
    172         # Compute the confusion matrix and class counts
    173         self.confusion_matrix_ = confusion_matrix_metric(
--> 174             y, y_pred, labels=self.classes_, sample_weight=self.sample_weight
    175         )
    176         self.class_counts_ = self.class_counts(y)

~\PycharmProjects\IGN Review\venv\lib\site-packages\sklearn\metrics\classification.py in confusion_matrix(y_true, y_pred, labels, sample_weight)
    260         labels = np.asarray(labels)
    261         if np.all([l not in y_true for l in labels]):
--> 262             raise ValueError("At least one label specified must be in y_true")
    263 
    264     if sample_weight is None:

ValueError: At least one label specified must be in y_true

Note My Target Variable is already of type float and in the forms of 1's and 0's so there wasn't any labelEncoding required in it. DataSet File attached with this DataSet.txt

bbengfort commented 5 years ago

@haseebkhan1421 - thank you for your question and for using Yellowbrick! There is potentially one of two errors happening here. First - are you using the latest version of Yellowbrick (v0.9)? If not please pip install -U yellowbrick, and you can then use the solution as discussed above:

cm = ConfusionMatrix(neighbor, classes=['0','1'], label_encoder={'0': 0, '1': 1})
cm.fit(FeatureData_Train,TargetData_Train)
cm.score(FeatureData_Test,TargetData_Test)

Note that classes is intended to give the figure nice class labels, you could also just omit this, e.g. ConfusionMatrix(neighbor) - does that work? Otherwise, you have to specify the label_encoder (above as adict) in order to map the string labels to the value labels (you mentioned they're type float, generally the target should be type int).

The second error is that you're actually in the situation that scikit-learn is warning about. This error occurs in scikit-learn if one of the classes is not represented in TargetData_Test. Usually, this is because the data is ordered and the train_test_split is not shuffling the data, or because there is a class balance issue.

My first suggestion would be to determine the class balance in your training data:

from yellowbrick.target import ClassBalance

oz = ClassBalance()
oz.fit(TargetData_Train, TargetData_Test)
oz.poof()

If one of the classes is missing in either the train or test splits, then this is where the error is occurring. You should be able to fix the problem by shuffling your data or using StratifiedKFolds.

bbengfort commented 5 years ago

Solution is posted on Stack Overflow:

https://stackoverflow.com/questions/54646168/error-creating-confusion-matrix-although-label-encoding-done-on-target-variable