dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

Predict() Method Always Returns 1 (Binary Classification) #62

Open kylejn27 opened 4 years ago

kylejn27 commented 4 years ago

When you attempt to use dxgb.XGBClassifier's predict method, it always generates a prediction of 1 regardless of the predict_proba (sigmoid) output. See minimal motivating example below, where I generate targets of all 0. The model learns it should generally predict 0 (low probabilities), but the predictions all generate 1.

Note: you cannot pass a threshold parameter into .predict(), another notable gap.

import dask_xgboost as dxgb
from dask.distributed import Client
import dask.array as da
import numpy as np

client = Client()

X = np.random.randint(1,5,(10,2))
y = np.zeros(10)

X = da.from_array(X)
y = da.from_array(y)

model = dxgb.XGBClassifier(n_estimator=5)
model.fit(X, y)

sigmoids = model.predict_proba(X).compute()
preds = model.predict(X).compute()

print(sigmoids, preds)

Output: (First list is sigmoids, second list is predictions)

[0.10914253 0.10914253 0.10914253 0.10914253 0.10914253 0.10914253
 0.10914253 0.10914253 0.10914253 0.10914253] [1 1 1 1 1 1 1 1 1 1]

It stems from line 537 of core.py

            cidx = (class_probs > 0).astype(np.int64)

Where any generated single dimensional class probability is evaluated as a 1. It's an easy fix, all you have to do is pass in a threshold parameter that allows you to set that 0 to some float and default that value to 0.5.

kylejn27 commented 4 years ago

@mmccarty