dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
893 stars 255 forks source link

RandomSearch CV crushes with XGboost #735

Open grzegorzkuprewicz-kinesso opened 4 years ago

grzegorzkuprewicz-kinesso commented 4 years ago

What happened: Using RandomizedSearchCV (either from dask-ml or from sklearn with dask's backend) with xgboost (1.2.0 version) the script crushes in most of the runs (sometimes, rather rarely, it ends with success, with the same code, data etc. which make the issue harder to diagnose). Lack of info make debugging hard - kernel died and sometimes Windows error “Instruction at Referenced Memory Could Not Be Read”.

Runs: Sklearn RandomSearchCV + xboost - successful Sklearn RandomSearchCV with dask backend + xboost - crush (sometimes successful) Dask RandomSearchCV with dask backend + xboost - crush (sometimes successful) Dask RandomSearchCV with dask-xgboost - crush dask-xgboost - numpy.array do not have "to_delayed" method but either dask DataFrames or dask Arrays were given

What you expected to happen: To use RandomizedSearchCV from dask with xgboost. Minimal Complete Verifiable Example:

I've attached the poc jupyter notebook I was using during tests. In the folder, I've placed also some screenshots. https://www.webcargo.net/l/17cuoKPByt/

Anything else we need to know?: I wasn't able to use xgboost==0.90 version because RandomizedSearchCV error "XGBoostError: need to call fit or load_model beforehand" Environment:

TomAugspurger commented 4 years ago

It's hard to say what's going on here. Anything you can do to minimize the problem would be welcome (http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)