dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

[Feature Request] Added the support of evals for the `train` function for early stopping #38

Closed DigitalPig closed 4 years ago

DigitalPig commented 5 years ago

Hi,

Currently the dask-xgboost package does not support early_stopping.

Not super familiar with dask or xgboost, but my very rough thinking is it should be just passing the validation set DMatrix object during the training function here? And the way of finding the location of validation set on each worker should be similar to what we are doing now with training dataset?

Please let me know if I am on the right track. I would love to see if I can get this eval functions to work for dask-xgboost but would like to avoid falling into apparent traps as I am not familiar with the backend xgboost/dask too much. Thank you!

mrocklin commented 5 years ago

Hi @DigitalPig ,

This package just sets up XGBoost and hands it data and keyword arguments. The normal XGBoost library then takes control and handles everything.

So so for features like early stopping you should probably ask in the XGBoost issue tracker rather than here. We can't control features like this.

DigitalPig commented 5 years ago

Hi @mrocklin , Thank you for the reply! Totally understand the "wrapper" function of dask-xgboost here. The reason I am asking is because xgboost's documentation says its train can handle early stopping as long as you pass a DMatrix object into it. That is the reason I am asking here to see if there is any specific reason why it is not implemented here. Because if you already thought about it before and had a significant technical issue then I would stop at here too. :)

mmccarty commented 5 years ago

@TomAugspurger may be able to provide insight on why this is not supported.

TomAugspurger commented 5 years ago

I don't recall, sorry.

On Tue, Aug 27, 2019 at 12:57 PM Mike McCarty notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger may be able to provide insight on why this is not supported.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/38?email_source=notifications&email_token=AAKAOIX2IOJWUQHCUA52CQLQGVTKNA5CNFSM4HIEENC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ITBWQ#issuecomment-525414618, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQQXQWBUSS2H3GWENLQGVTKNANCNFSM4HIEENCQ .

mmccarty commented 5 years ago

@TomAugspurger - No problem. I think we figured it out. Thank you for checking!

@DigitalPig - The native Dask integration should support early stopping. Here is an example. Notice the xgb.train call.

mrocklin commented 5 years ago

The reason I am asking is because xgboost's documentation says its train can handle early stopping as long as you pass a DMatrix object into it. That is the reason I am asking here to see if there is any specific reason why it is not implemented here. Because if you already thought about it before and had a significant technical issue then I would stop at here too. :)

Is it not implemented here? How would you ask XGBoost to do early stopping? Perhaps it is a keyword argument? Perhaps this library passes that keyword argument on to XGBoost?

datametrician commented 5 years ago

As I shared with @mmccarty, I believe XGBoost native Dask integration supports this https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.dask. I know the two are slightly different, but a good time to plug that XGBoost is supporting Dask natively. Hopefully, in the future, we can Highlander this and there will be only one option.

mrocklin commented 5 years ago

Understood and agreed. I'm still curious about in what way this implementation doesn't support early stopping.

mmccarty commented 5 years ago

The call to xgb.train passes the kwargs from the public method. So, I agree this should support early stopping for train.

However, fit does not pass kwargs to train so it would not support early stopping. I'm not sure why this is the case.

mrocklin commented 5 years ago

I wonder if this is handled by the get_xgb_params line just above?

On Tue, Aug 27, 2019 at 12:56 PM Mike McCarty notifications@github.com wrote:

The call to xgb.train https://github.com/dask/dask-xgboost/blob/4661c8a1d3a7f6b63ff994b944b6a6231e7c9f31/dask_xgboost/core.py#L90 passes the kwargs from the public method. So, I agree this should support early stopping for train.

However, fit https://github.com/dask/dask-xgboost/blob/4661c8a1d3a7f6b63ff994b944b6a6231e7c9f31/dask_xgboost/core.py#L281 does not pass kwargs to train so it would not support early stopping. I'm not sure why this is the case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/38?email_source=notifications&email_token=AACKZTEJYUVMPYETZ6MAFK3QGWBGPA5CNFSM4HIEENC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5I5UQQ#issuecomment-525457986, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTC575WJI3FK4NV6VGTQGWBGPANCNFSM4HIEENCQ .

mrocklin commented 5 years ago

Ah, it looks like the XGBoost code actually puts kwargs into the fit call, which is a break from Scikit-Learn convention. (I think? Is this right @TomAugspurger ? )

https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit

We can pass through **kwargs here, but if this is genuinely a break from convention then we should probably also raise an issue upstream with XGBoost.

TomAugspurger commented 5 years ago

It's a break with convention (but we could probably mirror the break, I'm not sure).

Something like sample_weights has to be passed here since it's data dependent.

I think supporting early_stopping_rounds requires supporting eval_set. I'm not sure if there are potential issues there with eval_set being a distributed dataframe or array.

On Tue, Aug 27, 2019 at 3:27 PM Matthew Rocklin notifications@github.com wrote:

Ah, it looks like the XGBoost code actually puts kwargs into the fit call, which is a break from Scikit-Learn convention. (I think? Is this right @TomAugspurger https://github.com/TomAugspurger ? )

https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit

We can pass through **kwargs here, but if this is genuinely a break from convention then we should probably also raise an issue upstream with XGBoost.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/38?email_source=notifications&email_token=AAKAOIUK3B5GWLTYLNZLLNTQGWE2TA5CNFSM4HIEENC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5JAL2A#issuecomment-525469160, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIVVCIKRA5XBL5CN2KTQGWE2TANCNFSM4HIEENCQ .

mmccarty commented 5 years ago

I wonder if this is handled by the get_xgb_params line just above?

I don't think so, by walking through the code those are the params passed into xgb.train, which requires a kwarg for early_stopping_rounds.

mmccarty commented 5 years ago

I think supporting early_stopping_rounds requires supporting eval_set. I'm not sure if there are potential issues there with eval_set being a distributed dataframe or array.

Perhaps this is why it wasn't supported originally? According to the docstring.

Ah, it looks like the XGBoost code actually puts kwargs into the fit call, which is a break from Scikit-Learn convention.

Interesting, if that's the convention it seems that xgb.sklearnXGBModel, which is the base class for XGBRegressor, also breaks it.

mmccarty commented 5 years ago

I've got an initial PR for this that I will push up for feedback as soon as I can.