dask / dask-ml

Scalable Machine Learning with Dask
BSD 3-Clause "New" or "Revised" License
902 stars 256 forks source link

Feature Request: sklearn.utils.estimator_checks, but with Dask collections #796

Open jameslamb opened 3 years ago

jameslamb commented 3 years ago

Feature Request

For any checks in sklearn.utils.estimator_checks that generate pandas DataFrames, scipy sparse arrays, or numpy arrays, implement equivalent checks in dask-ml, but which generate small Dask collections.

How this might improve dask-ml

Adding this feature might be one piece of providing a standardized path for writing scikit-learn compatible estimators that use Dask and take in data as Dask collections, like a Dask equivalent to "Developing scikit-learn estimators". This might give projects like xgboost, lightgbm, cuml, and others a target to hit and encourage a greater degree of consistency between them.


scikit-learn encourages the development of estimators that follow a specific API. This specification is described in detail in "Developing scikit-learn estimators".

To help projects that maintain scikit-learn-compatible detect incompatibilities with different scikit-learn versions, the project supports a collection of checks that can be run in unit tests. These are in the submodule sklearn.utils.estimator_checks: https://github.com/scikit-learn/scikit-learn/blob/31b34b560de57a049dd435dccc55112271322370/sklearn/utils/estimator_checks.py#L2194.

You can see LightGBM's unit tests for an example of how a project might use things from sklearn.utils.estimator_checks. https://github.com/microsoft/LightGBM/blob/eda1effb52b38cbad8f9cf7c28952f1077fc3c76/tests/python_package_test/test_sklearn.py#L1150-L1193

Unfortunately, many of these tests cannot be run for the scikit-learn-compatible estimators in xgboost.dask and lightgbm.dask (and their predecessors, dask-xgboost and dask-lightgbm). Many of those checks generate small test numpy arrays for training or validation data. Since xgboost.dask and lightgbm.dask scikit-learn estimators only accept data in Dask collections, it's not possible to use a pattern like this:

def _tested_estimators():
    for Estimator in [lgb.DaskLGBMClassifier, lgb.DaskLGBMRegressor]:
        yield Estimator()

def test_sklearn_integration(estimator, check, request):
    estimator.set_params(min_child_samples=1, min_data_in_bin=1)

Because most of the tests will fail with errors like the following (this one is from LightGBM, but xgboost.dask has similar behavior).

E TypeError: Data must be either Dask Array or Dask DataFrame. Got <class 'numpy.ndarray'>.

Notes for Reviewers

Sorry this is so open-ended. If there's anything I can do to make the request more specific or clearer, please let me know.

If there is interest in this, I'd be happy to help contribute piece of it!

Thanks for your time and consideration.

TomAugspurger commented 3 years ago

+1 to the general concept, and I think that dask-ml is a fine home for this utility to live.