For any checks in sklearn.utils.estimator_checks that generate pandas DataFrames, scipy sparse arrays, or numpy arrays, implement equivalent checks in dask-ml, but which generate small Dask collections.
How this might improve dask-ml
Adding this feature might be one piece of providing a standardized path for writing scikit-learn compatible estimators that use Dask and take in data as Dask collections, like a Dask equivalent to "Developing scikit-learn estimators". This might give projects like xgboost, lightgbm, cuml, and others a target to hit and encourage a greater degree of consistency between them.
Background
scikit-learn encourages the development of estimators that follow a specific API. This specification is described in detail in "Developing scikit-learn estimators".
Unfortunately, many of these tests cannot be run for the scikit-learn-compatible estimators in xgboost.dask and lightgbm.dask (and their predecessors, dask-xgboost and dask-lightgbm). Many of those checks generate small test numpy arrays for training or validation data. Since xgboost.dask and lightgbm.dask scikit-learn estimators only accept data in Dask collections, it's not possible to use a pattern like this:
def _tested_estimators():
for Estimator in [lgb.DaskLGBMClassifier, lgb.DaskLGBMRegressor]:
yield Estimator()
@parametrize_with_checks(list(_tested_estimators()))
def test_sklearn_integration(estimator, check, request):
estimator.set_params(min_child_samples=1, min_data_in_bin=1)
check(estimator)
Because most of the tests will fail with errors like the following (this one is from LightGBM, but xgboost.dask has similar behavior).
E TypeError: Data must be either Dask Array or Dask DataFrame. Got <class 'numpy.ndarray'>.
Notes for Reviewers
Sorry this is so open-ended. If there's anything I can do to make the request more specific or clearer, please let me know.
If there is interest in this, I'd be happy to help contribute piece of it!
Feature Request
For any checks in
sklearn.utils.estimator_checks
that generatepandas
DataFrames,scipy
sparse arrays, ornumpy
arrays, implement equivalent checks indask-ml
, but which generate small Dask collections.How this might improve
dask-ml
Adding this feature might be one piece of providing a standardized path for writing scikit-learn compatible estimators that use Dask and take in data as Dask collections, like a Dask equivalent to "Developing scikit-learn estimators". This might give projects like
xgboost
,lightgbm
,cuml
, and others a target to hit and encourage a greater degree of consistency between them.Background
scikit-learn
encourages the development of estimators that follow a specific API. This specification is described in detail in "Developing scikit-learn estimators".To help projects that maintain
scikit-learn
-compatible detect incompatibilities with differentscikit-learn
versions, the project supports a collection of checks that can be run in unit tests. These are in the submodulesklearn.utils.estimator_checks
: https://github.com/scikit-learn/scikit-learn/blob/31b34b560de57a049dd435dccc55112271322370/sklearn/utils/estimator_checks.py#L2194.You can see
LightGBM
's unit tests for an example of how a project might use things fromsklearn.utils.estimator_checks
. https://github.com/microsoft/LightGBM/blob/eda1effb52b38cbad8f9cf7c28952f1077fc3c76/tests/python_package_test/test_sklearn.py#L1150-L1193Unfortunately, many of these tests cannot be run for the scikit-learn-compatible estimators in
xgboost.dask
andlightgbm.dask
(and their predecessors,dask-xgboost
anddask-lightgbm
). Many of those checks generate small test numpy arrays for training or validation data. Sincexgboost.dask
andlightgbm.dask
scikit-learn estimators only accept data in Dask collections, it's not possible to use a pattern like this:Because most of the tests will fail with errors like the following (this one is from LightGBM, but
xgboost.dask
has similar behavior).Notes for Reviewers
Sorry this is so open-ended. If there's anything I can do to make the request more specific or clearer, please let me know.
If there is interest in this, I'd be happy to help contribute piece of it!
Thanks for your time and consideration.