dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
894 stars 255 forks source link

Errors with sklearn RandomizedGridSearch and DaskXGBoost #798

Open hayesgb opened 3 years ago

hayesgb commented 3 years ago

What happened: When running hyperparameter search with sklearn's RandomizedGridSearch and DaskXGBoostClassifier, get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/anaconda3/envs/daskml3/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
    209     try:
--> 210         return len(x)
    211     except TypeError as type_error:

TypeError: 'float' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-8-b0953fbb1d6e> in <module>
----> 1 clf.fit(X, y)

~/anaconda3/envs/daskml3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/anaconda3/envs/daskml3/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    757             refit_metric = self.refit
    758 
--> 759         X, y, groups = indexable(X, y, groups)
    760         fit_params = _check_fit_params(X, fit_params)
    761 

~/anaconda3/envs/daskml3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    297     """
    298     result = [_make_indexable(X) for X in iterables]
--> 299     check_consistent_length(*result)
    300     return result
    301 

~/anaconda3/envs/daskml3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    257     """
    258 
--> 259     lengths = [_num_samples(X) for X in arrays if X is not None]
    260     uniques = np.unique(lengths)
    261     if len(uniques) > 1:

~/anaconda3/envs/daskml3/lib/python3.7/site-packages/sklearn/utils/validation.py in <listcomp>(.0)
    257     """
    258 
--> 259     lengths = [_num_samples(X) for X in arrays if X is not None]
    260     uniques = np.unique(lengths)
    261     if len(uniques) > 1:

~/anaconda3/envs/daskml3/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
    210         return len(x)
    211     except TypeError as type_error:
--> 212         raise TypeError(message) from type_error
    213 
    214 

TypeError: Expected sequence or array-like, got <class 'dask.array.core.Array'>

What you expected to happen: Return a "best_estimator" after completing the gridsearch.

Alternative approaches would be to use dask_ml.model_selection.RandomizedGridSearch(), which results in the issue reported in #758 . It's my understanding from the documentation that the DaskXGBoost object passes the Dask Array to a DMatrix for parallel training, so that parallel search in conjunction with distributed training is a challenge to implement.

Minimal Complete Verifiable Example:

from dask.distributed import Client
import dask.dataframe as dd
import numpy as np
import xgboost as xgb

from dask_ml.compose import ColumnTransformer
import sklearn.model_selection as ms
from scipy.stats import uniform
import dask.array as da

client = Client()
client

X, y = make_classification(chunks=100)
X_col = dd.from_array(da.from_array(np.random.randint(0,2,size=X.shape[0]))).to_frame()
X = dd.from_array(X)
X = dd.concat([X, X_col], axis=1)
X = X.to_dask_array()

param_dict = {
'estimator__max_depth' : uniform(0, 1000),
'estimator__subsample' : uniform(0,100),
'estimator__colsample_bytree' : uniform(0, 10),
'estimator__n_estimators' : uniform(10, 1000),
'estimator__reg_lambda' : uniform(0, 10000),
'estimator__ccp_alpha'  : uniform(0, 10000),
'estimator__gamma'  : uniform(0, 20),
'estimator__scale_pos_weight' : uniform(0, 1000)
}

clf = ms.RandomizedSearchCV(xgb.dask.DaskXGBClassifier(), param_dict)
clf.fit(X, y)

Anything else we need to know?:

Environment:

hayesgb commented 3 years ago

If you call

X = X.compute_chunk_sizes()

before calling .fit(), then the search proceeds as expected.

This becomes particularly hard to troubleshoot when you use dask_ml.preprocessing.OneHotEncoder(), either alone, or in a pipeline.