Ch 3: Multi-Class Model Evaluation Methods and the "Unknown" Class

Hi Aurelien.

I have a couple of questions on the Classification material you presented in the book and in the Jupyter note:

1) You have outlined when one should go with the P/R model evaluation versus ROC... But why go with ROC in the first place?

E.g., what is so special about it that there is an ROC AUC function in sklearn and not PR AUC? After all, the concepts of precision and sensitivity seem to be more intuitive.

2) Is there a particular reason why you stopped at the accuracy scoring method when talking about the evaluation of the multi-class models? (the next section is already about more qualitative analysis of the confusion matrix)

The Precision/Recall and ROC analysis is definitely possible in the multi-class use cases and is in SciKit-Learn documentation. There is some additional work needed, but it's well worth it, I think, since the low accuracy for one of the many classes is especially easy to miss in a multi-class use case.

3) Would you agree that it's always better to have the "unknown" class whenever the use case allows it?

I think it's important to have because:

it explicitly covers (and makes traceable) the cases when there is malicious or garbage input (e.g. a smiley face instead of a numeric digit);
it improves the accuracy of a classifier for the "known" classes (there is some literature out there about it);
it is a necessary part of many use cases (e.g. asking a user for check amount verification before making an ATM check deposit, instead of going with the best-effort amount recognition);
we must have it if we want to balance precision versus recall in the multi-class use cases because we need a "drain" for the negative decisions on each class.

On the last point specifically, I have implemented setting the custom thresholds in a multi-class use case before and will gladly share the solution. I do relaize you have had a lot of work with the second edition of the book (which I have already pre-ordered :)

Thank you. Gene.

Hi Gene,

Thanks for your interesting message (and my apologies for the late response, I was on vacation).

1) ROC vs PR I suppose you are referring to the note at the end of the ROC curve section:

Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives. Otherwise, use the ROC curve. For example, looking at the previous ROC curve (and the ROC AUC score), you may think that the classifier is really good. But this is mostly because there are few positives (5s) compared to the negatives (non-5s). In contrast, the PR curve makes it clear that the classifier has room for improvement (the curve could be closer to the top-right corner).

I usually use PR, as I find it easier to interpret. Moreover, ROC is not well suited when the dataset is skewed. But if the dataset is balanced, and if false negatives and false positives are just as bad for your application, then the ROC AUC might be better, since it penalizes them both equally, while the PR AUC is more penalized by the false positives (you may want to double-checked this).

2) Why use accuracy for multiclass? Just to simplify the chapter a bit, but you're right, using PR in multiclass is doable and usually beneficial.

3) Use unknown class for multiclass? I have seen this a few times for semantic segmentation, where you may want a "background" class that covers many different types of possible background objects. However, I personally don't add an unknown class for classification tasks. I would assume that it really depends on the dataset: if there is a great diversity of unknown objects, I would assume it won't be very useful, but I may be wrong.

it explicitly covers (and makes traceable) the cases when there is malicious or garbage input (e.g. a smiley face instead of a numeric digit);

Indeed, that's one way to do it. However, the way I usually handle this is to look at the logit scores (before the softmax function): if they are too low, then (depending on the task) I may assume that the model is not sure enough.

it improves the accuracy of a classifier for the "known" classes (there is some literature out there about it);

Interesting. Do you have pointers to specific papers? I'd love to learn more about this. In particular, I would like to see this approach compared to the approach I described above (i.e., using a threshold for the logit scores).

it is a necessary part of many use cases (e.g. asking a user for check amount verification before making an ATM check deposit, instead of going with the best-effort amount recognition); we must have it if we want to balance precision versus recall in the multi-class use cases because we need a "drain" for the negative decisions on each class.

Yes, in many applications it's best to say "unsure" than to just pick the class with the highest estimated probability. I usually do this with the logit score threshold, but using an "unknown" class may work better, especially if there is a low diversity of unknown objects.

Hi Aurelien. Thank you for a very informative reply.

Here's one post (it was actually in my bookmarks back from my NLP days) that quantifies the accuracy improvement when adding the "neutral" class in the text sentiment analysis use case: The importance of Neutral Class in Sentiment Analysis

Question: will the logit score threshold method that you suggest allow us to set the different confidence thresholds for each class individually based on the PR analysis?

The way I accomplished this is:

adding the "unknown" class
creating the vector consisting of the score/probability thresholds for all the classes based on the PR analysis (the threshold for the "unknown" class is always artificially low)
during the evaluation: (a) go through the scores/probas and set all the values that fall below the corresponding thresholds to None; (b) apply regular voting

Hi @genemishchenko , That's very interesting, thanks for the link and your feedback. 👍

Question: will the logit score threshold method that you suggest allow us to set the different confidence thresholds for each class individually based on the PR analysis?

I've never used different thresholds for each class, but I suppose it could work. Here's a simple example, assuming there are 3 classes, using thresholds 10., 20., 30., respectively:

import numpy as np
thresholds = np.array([10., 20., 30.])
y_logits = np.array([[9., 1., 2.], [10., 5., 40.]]) # example output of model.predict(...)
y_max_logit = y_logits.max(axis=1)
y_pred = y_logits.argmax(axis=1)
y_threshold = thresholds[y_pred]
y_pred[y_max_logit < y_threshold] = -1  # class -1 means "unsure"
print(y_pred) # prints [-1  2]

One problem I see is what to do if the logits are [15., 16., 17.]. With the above code, class 2 would be selected since it has the max logit, then it would be eliminated because it's below the threshold. This seems reasonable to me since the most likely class is unsure, we should say we're unsure. However, one might argue that class 1 should be selected since it is the only one above threshold.

Hope this helps.

Right... So the way I implemented the mutli-threshold selection in the multi-class use case is by first eliminating the scores that fall below the corresponding thresholds BEFORE trying to find the max score for each instance... Then, in the cases when ALL the "known" class thresholds are eliminated the "unknown" class score is the only one that remains because the threshold for it is always set artifically low.

In summary on my original question 3 about the "unknown" class, it will probably still remain my personal favorite method because:

it is a cheap natural way to increase the accuracy for the "known" classes by explicitly labeling the "evil twin" instances of known target objects as "unknown" during training (both initially and, very importantly, based on operational findings)
logically speaking, it's mandatory in many use cases, and it does not really cost anything, so why not have it physically?
while it is not mandatory for the mutli-threshold selection in the multi-class use case implementation, I think it can serve as an intuitive "fallback" class during the implementation.

In summary on my original question 1 on PR vs ROC:

ROC AUC might be better, since it penalizes them both equally, while the PR AUC is more penalized by the false positives

This is very useful new information for me. Thank you. I don't think it's in the book, though.

In summary on my original question 2 on why stop at the accuracy evaluation for multiclass models:

Just to simplify the chapter a bit, but you're right, using PR in multiclass is doable and usually beneficial.

Completely understood. But I think you are contradicting yourself a bit when in the context of the binary use case you say that the overall accuracy is not really a good evalaution method and in the context of the multiclass use case you stop at it. Scikit-learn documentation has excellent information on how to perform PR and ROC evaluation in the multi-label use cases, and I think the references at least could be really beneficial: Precision-Recall in multi-label settings Receiver Operating Characteristic in multiclass settings

Hi @genemishchenko ,

Thanks for your insights!

You make some very good points. My only worry about the "unknown" class is when there's a wide variety of unknowns. For example, suppose you want to classify mushrooms, but you only care about 3 of them (classes 1, 2, 3). The training set will contain many labeled pictures of the 3 mushrooms, plus a number of other mushrooms, classified as "unknown". Suppose the training set contains 10 unknown types of mushrooms, but in real life there are hundreds of other mushroom types, many of which are very different from the 10 unknown in the training set (and some of them are even vaguely similar to the 3 known mushrooms). Now suppose you show your model a really unknown mushroom (i.e., one that's not even in the training set): it will give low scores to all classes, including the unknown class. The "unknown" class is a bit more likely to get the best (low) score, but it really depends on the task. If the really unknown mushroom looks even a little bit like one of the known mushrooms, say class 1, then this class will have a higher score than the "unknown" class (and the other classes), and you may end up with a highly confident but wrong classification. If instead you used the threshold approach, you would not need to have many unknown mushrooms in the training set, and in the example above you would get a low score for classes 1, 2, 3, hopefully all below the threshold. Does that make sense?

Regarding the choice of ROC vs PR, the book contains this tip: "Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives". That said, I think I remember adding this tip after the 1st release of the book, perhaps around release 6 or 7 (I get to add or fix small things every time there is a reprint). You can see the release you have on the page immediately before the table of contents.

Regarding the use of the accuracy, I think it's a perfectly valid metric as long as the dataset is not skewed, i.e., as long as all classes are equally likely. In the binary classification example, the dataset was very skewed (10% positive, 90% negative). But in the multiclass case, it's not. I will add a comment to make this clear, thanks for your feedback! :)

Cheers, Aurélien

Hi Aurélien,

That makes complete sense. I see now that having the "unknown/other" class is not a good idea, unless we're willing to actually get a reasonably representative training set for it from the real world. Otherwise, it's better to skip it.

Thank you! Gene.

ageron / handson-ml

Ch 3: Multi-Class Model Evaluation Methods and the "Unknown" Class #456