When using Early stopping, the eval set must be a numpy array, which is then duplicated across workers. This causes no problem with small eval_sets, but when larger eval_sets are desired, this can easily cause workers to push their memory caps.
The DaskDMatrix concept from dmlc/xgboost/dask.py seems to be a great way to handle this. Maybe there's something that can be implemented in this library that mimics that functionality
When using Early stopping, the eval set must be a numpy array, which is then duplicated across workers. This causes no problem with small eval_sets, but when larger eval_sets are desired, this can easily cause workers to push their memory caps.
The DaskDMatrix concept from dmlc/xgboost/dask.py seems to be a great way to handle this. Maybe there's something that can be implemented in this library that mimics that functionality
I'd be happy to take a crack at this, rather than reworking the whole library to work with a DaskDMatrix its probably simpler to this with the evals_set data https://github.cloud.capitalone.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py#L167-L203
Example
After running this you should see
KilledWorker
exceptions