dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

Early stopping eval_set is array in memory this can be problematic for large datasets #63

Open kylejn27 opened 4 years ago

kylejn27 commented 4 years ago

When using Early stopping, the eval set must be a numpy array, which is then duplicated across workers. This causes no problem with small eval_sets, but when larger eval_sets are desired, this can easily cause workers to push their memory caps.

The DaskDMatrix concept from dmlc/xgboost/dask.py seems to be a great way to handle this. Maybe there's something that can be implemented in this library that mimics that functionality

I'd be happy to take a crack at this, rather than reworking the whole library to work with a DaskDMatrix its probably simpler to this with the evals_set data https://github.cloud.capitalone.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py#L167-L203

Example

import dask_xgboost as dxgb
from dask.distributed import LocalCluster, Client
from dask_ml.datasets import make_regression

client = Client(LocalCluster(dashboard_address=":8887", memory_limit="100Mb"))

regress_kwargs = dict(n_features=60, chunks=100, random_state=0)
X_train, y_train = make_regression(n_samples=400000, **regress_kwargs)
# this produces data to push memory limits
X_test, y_test = make_regression(n_samples=180000, **regress_kwargs)

xgb_options = {'seed': 0,
                   'tree_method': 'hist',
                   'obj': 'rmse',
                   'verbose': True}

model = dxgb.XGBRegressor(**xgb_options)

model.fit(X_train,
    y_train,
    eval_set=[(X_test.compute(), y_test.compute())],
    early_stopping_rounds=5,
    eval_metric='rmse'
)

After running this you should see KilledWorker exceptions

kylejn27 commented 4 years ago

@mmccarty