dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

LogisticRegression's predict_proba(X) returns (n,) instead of (n,2) for binary class #386

Open vincentvdp opened 5 years ago

vincentvdp commented 5 years ago

Just like sklearn's precit_proba(X), Dask's documentation says it returns an "array-like, shape = [n_samples, n_classes]". However, it returns an "array-like, shape = [n_samples, 1]", for a binary classifier. E.g. run this:

import numpy as np
from sklearn.linear_model import LogisticRegression
from dask_ml.linear_model import LogisticRegression as DaskLogit

# Data
X = np.random.rand(200,20)
y = np.random.rand(200).round(0) # random binary labels

# Dask 
lrdask = DaskLogit()
lrdask.fit(X, y)
print("Logit Reg 'predict_proba' shape Dask\t", lrdask.predict_proba(X).shape)

# SKLearn
lrsklr = LogisticRegression()
lrsklr.fit(X, y)
print("Logit Reg 'predict_proba' shape SKLearn\t", lrsklr.predict_proba(X).shape)

Of course, no information is lost, but this does break stuff when used in other code. E.g. in GridSearchCV's scorer I get:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py in __call__(self, clf, X, y, sample_weight)
    184 
    185                 if y_type == "binary":
--> 186                     y_pred = y_pred[:, 1]
    187                 elif isinstance(y_pred, list):
    188                     y_pred = np.vstack([p[:, -1] for p in y_pred]).T

IndexError: too many indices for array

predict_proba(X)'s return value is indeed implemented differently in sklearn and dask-ml: sklearn:

def _predict_proba_lr(self, X):
    ...
    return np.vstack([1 - prob, prob]).T

dask-ml:

def predict_proba(self, X):
    ...
    return sigmoid(dot(X_, self._coef))

Is this intentional and would changing this break code that currently expects Dask to return (n,) and not (n,2)? Or is this an oversight, and should it return (n,2) as per documentation?

TomAugspurger commented 5 years ago

You're correct. This should match scikit-learn's behavior.

atyamsriharsha commented 5 years ago

@TomAugspurger I want to work on this issue. Are there any inputs that you want to provide about this issue before I tackle this? Thanks

TomAugspurger commented 5 years ago

Nope, nothing beyond we should match scikit-learn.

atyamsriharsha commented 5 years ago

@TomAugspurger as I go through the code I see that there is no implementation for multinomial logistic regression. Is someone working on it ? or do I need to create an issue and work on that?

Collonville commented 5 years ago

Is there any update for this issue? I`m having the same problem between scikit-learn and dask-ml Logistic regression predict_proba function behaviour.

TomAugspurger commented 5 years ago

Still open, if you're interested in working on it.

IIUC, there are two issues:

  1. For binary logicist regression, properly (N,) shape output to (N, 2).
  2. Implement multinomial logistic regression.

The first sounds relatively easier, so we might want to start with that. That would be around https://github.com/dask/dask-ml/blob/master/dask_ml/linear_model/glm.py#L248

rikturr commented 3 years ago

I'm interested in working on this issue, and noticed there aren't any PRs associated with this currently (just for whats listed in the issue title, not implementing multinomial). I'll move forward unless I hear otherwise or see a PR!