dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
893 stars 255 forks source link

Errant pre-compute using `xgboost.dask.DaskXGBClassifier` and `RandomizedSearchCV` #758

Open gforsyth opened 3 years ago

gforsyth commented 3 years ago

With xgboost 1.2 and current master on dask-ml, passing in DaskXGBClassifier() to RandomizedSearchCV fails due to fit_and_score receiving numpy arrays instead of dask arrays.

I'm still trying to track down where the errant compute is happening (very possible it's happening on the xgboost side, just wanted to raise this here for awareness)

before fit_and_score is called, the graph is updated with the keys for X and y, but when the graph is executed numpy arrays are retrieved instead. (Could be an errant client.get on the xgboost side)

What happened: In the notebook, a bunch of "key has failed" messages -- in the terminal this error repeatedly:

distributed.worker - WARNING -  Compute Failed
Function:  fit_and_score
args:      (DaskXGBClassifier(), <dask_ml.model_selection.methods.CVCache object at 0x7faab0335340>, array([[-2.27450463,  2.11465447,  1.22691127, ..., -0.88190765,
         0.09848558,  0.41462044],
       [ 0.43908491,  0.2100192 , -0.18272626, ..., -1.49498831,
         1.07761167,  0.20180108],
       [ 0.83336784,  2.68069819,  0.8442648 , ..., -0.77917656,
        -0.1129625 , -1.29956418],
       ...,
       [-0.08958046,  1.04839125,  0.62148611, ..., -1.2469329 ,
         0.94976797,  0.56911496],
       [ 1.08962267,  0.78895839,  1.16558679, ...,  0.71315066,
         1.04262422, -0.97562332],
       [-0.29380091, -1.05777127,  0.6632089 , ..., -1.2499148 ,
         0.1178788 ,  0.22684799]]), array([0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,

kwargs:    {}
Exception: TypeError("Expecting <class 'dask.dataframe.core.DataFrame'> or <class 'dask.array.core.Array'>.  Got <class 'numpy.ndarray'>")

What you expected to happen: I expect to get back a trained booster.

Minimal Complete Verifiable Example:

from distributed.client import Client
from dask_ml.model_selection import RandomizedSearchCV
from dask_ml.datasets import make_classification
from xgboost.dask import DaskXGBClassifier
import xgboost
import dask_ml
​
print(f"{xgboost.__version__=}")
print(f"{dask_ml.__version__=}")
​
c = Client()
​
param_distibutions = {
    "max_depth": [5],
    "min_child_weight": [10],
    "learning_rate": [0.05],
}
​
X, y = make_classification(n_samples=1000, n_features=20, chunks=(100, 20))
​
estimator = DaskXGBClassifier()
​
estimator
​
clf = RandomizedSearchCV(estimator, param_distibutions)
​
clf.fit(X, y)

xgboost.__version__='1.2.0'
dask_ml.__version__='1.7.1.dev3+gc55c1898.d20201119'
/Users/vjs275/miniforge3/envs/msrm2/lib/python3.8/site-packages/sklearn/model_selection/_search.py:278: UserWarning: The total space of parameters 1 is smaller than n_iter=10. Running 1 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
('daskxgbclassifier-fit-score-61fb6731d022ad077e150ddca561ff0e', 0, 0) has failed... retrying
('daskxgbclassifier-fit-score-61fb6731d022ad077e150ddca561ff0e', 0, 0) has failed... retrying
('daskxgbclassifier-fit-score-61fb6731d022ad077e150ddca561ff0e', 0, 0) has failed... retrying
('daskxgbclassifier-fit-score-61fb6731d022ad077e150ddca561ff0e', 0, 0) has failed... retrying
('daskxgbclassifier-fit-score-61fb6731d022ad077e150ddca561ff0e', 0, 2) has failed... retrying
('daskxgbclassifier-fit-score-61fb6731d022ad077e150ddca561ff0e', 0, 0) has failed... retrying

Anything else we need to know?: I ran into this with xgboost 1.3 snapshots, but ran this MCVE with xgboost 1.2 to confirm I didn't do anything weird with the snapshot install.

Environment:

gforsyth commented 3 years ago

I think I got caught in an xy-problem right here. sklearn.model_selection.GridSearchCV works here which, I think, resolves this particular issue, but I'll run some more tests.