Theory Question: Classification

VirajVaitha123 commented 5 years ago

Hi,

I have become slightly confused with some of the different classification techniques.

I understand that some techniques such as random forest or naive Bayes can be used for multi classification. Logistic regression is meant for binary classification. However Scikit learn automatically can use an OvA Strategy when multiple classes are identified.

Alternatively, Softmax Regression can be used to handle multiple classes directly (without training several binary classifiers) but let's leave this one for now.

Now SGDClassifier uses a different solver for classification. (Still uses OvA Strategy, however OvO can be used to wrap this if change of strategy is required). SGDClassifer is more useful for large datasets.

What is the solver for SGDClassifer vs Logistic Regression. Can someone emphasise the difference between these two. Does Logistic and SGDClassifier use the same cost function but different solvers? if so which solvers?

Thank you,

ageron commented 5 years ago

Hi @VirajVaitha123 ,

Thanks for your question. The SGDClassifier implements Softmax Regression when there are multiple classes, it just minimizes the cross-entropy loss using stochastic gradient descent (it does not perform OvA or OvO). The only place where it uses OvA is in the predict_proba() and predict_log_proba() methods, to estimate the class probabilities (based on the decision scores returned by the decision_function() method). For more details, check out the predict_proba() method in the documentation for the SGDClassifier class, and the Zadrozny and Elkan paper they point to: “Transforming classifier scores into multiclass probability estimates”, SIGKDD‘02.

The LogisticRegression class currently defaults to OvR (==OvA), but in Scikit-Learn 0.22 it will default to minimizing the cross-entropy loss as well, but using the lbfgs solver. To be precise, the default multi_class strategy is currently "ovr", but it will switch to "auto" in Scikit-Learn 0.22. When this happens, the multiclass strategy will will be OvR if the solver is "liblinear", or "multinomial" otherwise. The default solver will also change in 0.22, from "liblinear" to "lbfgs". For more details, check out the multi_class and solver hyperparameters in the documentation for the LogisticRegression class, and

Hope this helps!

VirajVaitha123 commented 5 years ago

Thank you :)

VirajVaitha123 commented 5 years ago

I believe there is a small contradiction on page 96, where it mentions when a binary classifer is used for multi-class classification, it automatically runs OvA (Apart from SVM's)

I checked the documentation and it seems that the below are the permutations that can be used: LogisticRegression can be set to either OvR or Softmax if multinomial selected with your chosen solver (not SGD)

SGDClassifier can be set to have Log Loss (Logistic Regression) now with SGDSolver. Switches to OvA (OvR) if multiple classes are recognised.

Generally experiment with all combinations if necessary in ML projects anyway

ageron commented 5 years ago

Thanks @VirajVaitha123 , I'll double check this section, perhaps the default hyperparameters changed since I wrote it? If so, I'll fix it, thanks again.

ageron / handson-ml

Theory Question: Classification #431