DaskXGBClassifier and dask-ml GridSearchCV throws TypeErrors with Dask Arrays.

mseflek commented 3 years ago

I believe this is related to #521. When using dask-ml with xgboost.dask.DaskXGBClassifier and dask_ml.model_selection.GridSearchCV, you run into TypeErrors suggesting that the estimator expects dask.array.core.Array types, even while using Dask Arrays.

MCVE:

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from xgboost.dask import DaskXGBClassifier
from dask.distributed import Client
from sklearn.datasets import load_iris

if __name__ == '__main__':

    client = Client()

    data = load_iris()
    x = da.from_array(data.data)
    y = da.from_array(data.target)

    estimator = DaskXGBClassifier(objective='multi:softmax', num_class=4)
    grid_search = GridSearchCV(
        estimator,
        param_grid={
            'n_estimators': np.arange(15, 105, 15)
        },
    )

    grid_search.fit(x, y)
    results = pd.DataFrame(grid_search.cv_results_)
    print(results.to_string())

Error:

distributed.worker - WARNING - Compute Failed                                                                                                                   
Function:  fit_and_score                                                                                                                                        
args:      (DaskXGBClassifier(num_class=4, objective='multi:softmax'), <dask_ml.model_selection.methods.CVCache object at 0x7f0be9fa4f40>, array([[5.1, 3.5, 1.4
, 0.2],                                                                                                                                                         
       [4.9, 3. , 1.4, 0.2],                                                                                                                                    
       [4.7, 3.2, 1.3, 0.2],                                                                                                                                    
       [4.6, 3.1, 1.5, 0.2],                                                                                                                                    
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
...
                                                                                                                                                  kwargs:    {}                                                                                                                                                   
Exception: TypeError("Expecting <class 'dask.dataframe.core.DataFrame'> or <class 'dask.array.core.Array'>.  Got <class 'numpy.ndarray'>")

Environment:

Dask version: 2021.04.0
Dask-ml version: 1.9.0
Python version: 3.8.5
Operating System: Ubuntu 20.04
Install method (conda, pip, source): pip

jakirkham commented 3 years ago

I think the expectation is that param_grid is given a Dask Array as well

mseflek commented 3 years ago

Unfortunately that doesn't seem to be the case. I changed it to:

param_grid={
            'n_estimators': da.from_array(np.arange(15, 105, 15))
        },

and got this:

ValueError: Parameter grid for parameter (n_estimators) needs to be a list or numpy array, but got (<class 'dask.array.core.Array'>). Single values need to be wrapped in a list with one element.

jakirkham commented 3 years ago

cc @hcho3 @trivialfis (in case either of you have thoughts 🙂)

trivialfis commented 3 years ago

TL;DR: Use xgb.XGBClassifier with dask_ml.model_selection.GridSearchCV.
Longer story: The GridSearchCV is designed to extend single node algorithms, like estimators in scikit-learn that accepts numpy array. However, xgb.dask.DaskXGBClassifier itself is distributed and accepts dask collections only. The GridSearchCV divides data and runs the estimators with different hyper-parameter on some local data partitions, instead of running them on the distributed data structure. To support a distributed GridSearchCV, we need to implement a new version of this function that can pass the dask collection directly into estimators.

mseflek commented 3 years ago

I see. Thanks for the insight @trivialfis and @jakirkham. Not sure if this is already in the documentation but would be great if that could be clarified.

I'll run some benchmarks on my end as well, but want to ask if there's a best practice here. What would you suggest if you are in a single node, multi-GPU situation and want to use GridSearchCV and XGBoost for hyperparameter tuning? Just to get it to work I'm using xgb.dask.DaskXGBClassifier and Scikit's GridSearchCV but wondering if a different strategy would be more efficient.

Also should I convert this to a feature request so we can keep track of your suggested revision?

Thanks!

jakirkham commented 3 years ago

Yeah we were working on getting some documentation moved over as well. Maybe it is worth looking at PR ( https://github.com/dmlc/xgboost/pull/6970 ) to see if this answers your question or if there are things we should be capturing there?

@trivialfis should be able to advise better on usage details. So will let him follow up on those

mseflek commented 3 years ago

Answering my own benchmarking question: on 4 x Nvidia V100 GPUs:

`xgboost.XGBClassifier` with `dask_ml.model_selection.GridSearchCV`

import dask.array as da
import dask.dataframe as dd
import numpy as np
from sklearn.datasets import load_iris

from dask_cuda import LocalCUDACluster
from dask_ml.model_selection import GridSearchCV
from xgboost import XGBClassifier
from dask.distributed import Client
import time

cluster = LocalCUDACluster()
client =  Client(cluster)

data = load_iris()
x = da.from_array(data.data)
y = da.from_array(data.target)

start = time.time()
estimator = XGBClassifier(objective='multi:softmax', num_class=4, tree_method='gpu_hist')

grid_search = GridSearchCV(
    estimator,
    param_grid={
        'n_estimators': np.arange(15, 105, 15)
    },
)

grid_search.fit(x, y)
print(time.time() - start)

Output: 32 seconds

`xgboost.dask.DaskXGBClassifier` with `sklearn.model_selection.GridSearchCV`

import dask.array as da
import dask.dataframe as dd
import numpy as np
from sklearn.datasets import load_iris

from sklearn.model_selection import GridSearchCV as skGridCV
from xgboost.dask import DaskXGBClassifier
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import time

cluster = LocalCUDACluster()
client =  Client(cluster)

data = load_iris()
x = da.from_array(data.data)
y = da.from_array(data.target)

start = time.time()
estimator = DaskXGBClassifier(objective='multi:softmax', num_class=4, tree_method='gpu_hist')
estimator.client = client

grid_search = skGridCV(
    estimator,
    param_grid={
        'n_estimators': np.arange(15, 105, 15)
    },
)

grid_search.fit(x, y)
print(time.time() - start)

Output: 11 seconds.

trivialfis commented 3 years ago

I think using the sklearn grid search will pull the data to workers arbitrarily.

Also should I convert this to a feature request so we can keep track of your suggested revision?

Yup, that would be great.

mseflek commented 3 years ago

I think using the sklearn grid search will pull the data to workers arbitrarily.

Meaning that the results from the GridsearchCV will be invalid?

trivialfis commented 3 years ago

The result should be correct. Just quite inefficient.

dask / dask-ml