ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.98k stars 12.79k forks source link

Ch 04: Softmax Regressor vs Multiple OvR Logit Regressors #316

Open genemishchenko opened 3 years ago

genemishchenko commented 3 years ago

Hi.

While the book makes it very clear how to create a multinomial logistic (softmax) regressor instead of the default composite model based on several OvR binary logistic regressors in the multilabel classification scenario (LogisticRegression(multi_class="multinomial")), what is NOT clear is:

  1. which one is more accurate in general (in this example the softmax regressor is clearly more accurate)
  2. which one has better performance

It would be greatly appreciated if this clarification was either in the book or in the Jupyter note.

Thanks. Gene.

Praful932 commented 3 years ago

Hi @genemishchenko , I haven't come across any such cases, but what I would like to add is, when we use OvR binary logistic regressors then we have to train a separate classifer for each class on the full training set, Now if the no of classes for eg are 1000, and the dataset is humungous then it would preferrable to use softmax one, this is also one of the reasons why if we were to create a classifer using SVM, we would go for OvOthan OvR.

genemishchenko commented 3 years ago

Thank you, @Praful932 . So you can't comment on the accuracy of softmax versus OvR logistic?

Praful932 commented 3 years ago

@genemishchenko No, it all comes down to use case eventually, what performs better in practise while considering all the metrics for the same(Precision, Recall, F1 Score), you could try out both of these on your problem :)

genemishchenko commented 3 years ago

I think that OvR Logit is much more limited in how it can set the decision boundaries between multiple classes, when compared to Softmax. This is not at all obvious to a beginner and this should be stated in the book (sorry, if I missed it, but I don't think I did).

For the "edge" classes (for which the typical feature values in combination make the instances stand out) the accuracy of Softmax and of OvR Logit is comparable. This is why in a binary classification setting Softmax has no advantage really. For any of the "middle" classes (for which the distinction just by the feature values is not really there) the OvR Logit accuracy plummets. And the more classes there are with fewer features - the more obvious this becomes (that is not to say that Softmax is limitless).

I took the example from scikit-learn documentation that I mentioned earlier and produced the confusion matrices for the Softmax and the OvR Logit classifiers, in addition to the instance-level plotting and the average accuracy calculation that was already there. Below are (1) the scatter plot of the instances with the color-coded labels (just to show that I have the same data as in the example) and (2) the normalzied confusion matrices for Softmax and for OvR Logit: image image

This is not a hard classification problem - the instances from different classes are barely mixed, so it's reasonable to expect very high accuracy for all the classes. But OvR Logit performs relatively poorly for the "middle" class. In real life it's called "stumbling on the even surface".

Going back to the book, if someone tried OvR Logit on the last example in Chapter 4 (three-label classification of the Iris species), I am willing to bet any amount of money that the accuracy for Iris Versicolor (the "middle class") would not be good.