dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
385 stars 71 forks source link

[BUG] `experiment_class` fails on GPU #943

Open sarahyurick opened 1 year ago

sarahyurick commented 1 year ago

In #886, we removed all dependencies on Dask-ML in favor of scikit-learn, cuML, and our own classes (ParallelPostFit and Incremental). Previously, when creating an experiment, experiment_class was expected to be a path to a dask_ml class, but sklearn classes were also found to be compatible. However, I couldn't get it to work with cuml, such as with cuml.model_selection.GridSearchCV. For example:

c.sql(
    """
    CREATE EXPERIMENT my_exp WITH (
    model_class = 'sklearn.ensemble.GradientBoostingClassifier',
    experiment_class = 'cuml.model_selection.GridSearchCV',
    tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
                       max_depth = ARRAY [3,4,5,10]),
    target_column = 'target'
) AS (
        SELECT x, y, x*y > 0 AS target
        FROM timeseries
        LIMIT 100
    )
    """
)

errors with:

INFO:dask_sql.physical.rel.custom.create_experiment:{'n_estimators': [16, 32, 2], 'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 4, 5, 10]}
INFO:dask_sql.physical.rel.custom.create_experiment:{}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 c.sql(
      2     """
      3     CREATE EXPERIMENT my_exp WITH (
      4     model_class = 'sklearn.ensemble.GradientBoostingClassifier',
      5     experiment_class = 'cuml.model_selection.GridSearchCV',
      6     tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
      7                        max_depth = ARRAY [3,4,5,10]),
      8     target_column = 'target'
      9 ) AS (
     10         SELECT x, y, x*y > 0 AS target
     11         FROM timeseries
     12         LIMIT 100
     13     )
     14     """
     15 )

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:501, in Context.sql(self, sql, return_futures, dataframes, gpu, config_options)
    496 else:
    497     raise RuntimeError(
    498         f"Encountered unsupported `LogicalPlan` sql type: {type(sql)}"
    499     )
--> 501 return self._compute_table_from_rel(rel, return_futures)

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:830, in Context._compute_table_from_rel(self, rel, return_futures)
    829 def _compute_table_from_rel(self, rel: "LogicalPlan", return_futures: bool = True):
--> 830     dc = RelConverter.convert(rel, context=self)
    832     # Optimization might remove some alias projects. Make sure to keep them here.
    833     select_names = [field for field in rel.getRowType().getFieldList()]

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/convert.py:61, in RelConverter.convert(cls, rel, context)
     55     raise NotImplementedError(
     56         f"No relational conversion for node type {node_type} available (yet)."
     57     )
     58 logger.debug(
     59     f"Processing REL {rel} using {plugin_instance.__class__.__name__}..."
     60 )
---> 61 df = plugin_instance.convert(rel, context=context)
     62 logger.debug(f"Processed REL {rel} into {LoggableDataFrame(df)}")
     63 return df

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/custom/create_experiment.py:169, in CreateExperimentPlugin.convert(self, rel, context)
    167 search = ExperimentClass(model, {**parameters}, **experiment_kwargs)
    168 logger.info(tune_fit_kwargs)
--> 169 search.fit(
    170     X.to_dask_array(lengths=True),
    171     y.to_dask_array(lengths=True),
    172     **tune_fit_kwargs,
    173 )
    174 df = pd.DataFrame(search.cv_results_)
    175 df["model_class"] = model_class

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_search.py:786, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    783 X, y, groups = indexable(X, y, groups)
    784 fit_params = _check_fit_params(X, fit_params)
--> 786 cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator))
    787 n_splits = cv_orig.get_n_splits(X, y, groups)
    789 base_estimator = clone(self.estimator)

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_split.py:2331, in check_cv(cv, y, classifier)
   2326 cv = 5 if cv is None else cv
   2327 if isinstance(cv, numbers.Integral):
   2328     if (
   2329         classifier
   2330         and (y is not None)
-> 2331         and (type_of_target(y, input_name="y") in ("binary", "multiclass"))
   2332     ):
   2333         return StratifiedKFold(cv)
   2334     else:

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:286, in type_of_target(y, input_name)
    283 if sparse_pandas:
    284     raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'")
--> 286 if is_multilabel(y):
    287     return "multilabel-indicator"
    289 # DeprecationWarning will be replaced by ValueError, see NEP 34
    290 # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:152, in is_multilabel(y)
    150 warnings.simplefilter("error", np.VisibleDeprecationWarning)
    151 try:
--> 152     y = np.asarray(y)
    153 except (np.VisibleDeprecationWarning, ValueError):
    154     # dtype=object should be provided explicitly for ragged arrays,
    155     # see NEP 34
    156     y = np.array(y, dtype=object)

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask/array/core.py:1704, in Array.__array__(self, dtype, **kwargs)
   1702     x = x.astype(dtype)
   1703 if not isinstance(x, np.ndarray):
-> 1704     x = np.array(x)
   1705 return x

File cupy/_core/core.pyx:1473, in cupy._core.core._ndarray_base.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Using model_class = 'xgboost.XGBClassifier' or model_class = 'xgboost.dask.XGBClassifier' results in the same error as above.

When I try it with a model_class from cuML, more errors arise. For example, if I try it with model_class = 'cuml.dask.ensemble.RandomForestClassifier' (cuML has no GradientBoostingClassifier), scikit-learn raises a

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator <cuml.dask.ensemble.randomforestclassifier.RandomForestClassifier object at 0x7f0c5f692820> does not.

I tried a couple of different changes on the Dask-SQL side but have yet to find a solution. It's possible that this will require changes on the Dask and/or cuML side of things.

sarahyurick commented 1 year ago

Should also look into whether model_class failures on the GPU with xgboost.XGBClassifier and xgboost.dask.DaskXGBClassifier are related to this issue.

Update: Opened https://github.com/dask-contrib/dask-sql/issues/1020

sarahyurick commented 1 year ago

After some investigation, it seems like the issue runs pretty deep. Assuming that we can make the necessary changes on the scikit-learn side, quite a few errors still pop up on the Dask and cuML sides as well.