dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.13k stars 8.7k forks source link

Distributed hyper parameter optimization for dask. #6525

Open trivialfis opened 3 years ago

trivialfis commented 3 years ago

Right now, integration between dask and various single node machine learning libraries are implemented as standalone dask extensions like dask-ml and dask-optuna. These can be used with xgboost when xgboost is performing single node training. That's using XGBRegressor and friends with them, instead of using xgboost.dask.DaskXGBRegressor. If users want to train the entire dataset on 1 model, the dask interface is required. The underlying issue is xgboost by itself is a distributed learning library employing a MPI like communication framework, but those extensions are designed to extend single node libraries. To resolve it, we need to design python wrappers that can glue them together.

Optuna is an exception as it's using callback function in xgboost, so the xgboost.dask interface can be adopted to optuna. I will submit some changes with demos later. Others like grid searching are more difficult to implement.

Related: https://github.com/dmlc/xgboost/issues/5347

cc @pseudotensor @sandys

sandys commented 3 years ago

Thank you for taking this forward.

We are also seriously looking at Ray Distributed - because of its native Integration with kubernetes ( https://docs.ray.io/en/master/cluster/kubernetes.html ) and its success at Ant Financial in China ( https://docs.ray.io/en/master/cluster/kubernetes.html )

My thought is that you won't be able to make it generic enough to fit all frameworks. And probably it's an overkill of engineering to do so.

The one thing I would strongly request you to align towards is kubernetes compatibility. Because it's the management framework that has massive amount of support.

It is for this reason that Torch chose to write a thin wrapper over kubernetes for its distributed training library - https://pytorch.org/elastic/0.2.1/index.html

https://aws.amazon.com/blogs/containers/fault-tolerant-distributed-machine-learning-training-with-the-torchelastic-controller-for-kubernetes/

So I would suggest when you look at solving this problem from a kubernetes first standpoint. Compatibility with every framework is going to take up more of your time than you might be willing to spend.

Sincerely Sandeep

On Thu, 17 Dec, 2020, 19:00 Jiaming Yuan, notifications@github.com wrote:

Right now, integration between dask and various single node machine learning libraries are implemented as standalone dask extensions like dask-ml and dask-optuna. These can be used with xgboost when xgboost is performing single node training. That's using XGBRegressor and friends with them, instead of using xgboost.dask.DaskXGBRegressor. If users want to train the entire dataset on 1 model, the dask interface is required. The underlying issue is xgboost by itself is a distributed learning library employing a MPI like communication framework, but those extensions are designed to extend single node libraries. To resolve it, we need to design python wrappers that can glue them together.

Optuna is an exception as it's using callback function in xgboost, so the xgboost.dask interface can be adopted to optuna. I will submit some changes with demos later. Others like grid searching are more difficult to implement.

Related: #5347 https://github.com/dmlc/xgboost/issues/5347

cc @pseudotensor https://github.com/pseudotensor @sandys https://github.com/sandys

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/6525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASYU3WUDQSFYIEA2IBT6DSVIBVRANCNFSM4U7VFG7Q .

trivialfis commented 3 years ago

@sandys k8s support is in 1.3.

trivialfis commented 3 years ago

@sandys Use https://kubernetes.dask.org/en/latest/ to create the cluster, and train xgboost dask as usual.

sandys commented 3 years ago

Yes. Although dask has some independent issues with kubernetes - on the scaling side.

Ray uses kubernetes scheduler. Dask brings a centralised scheduler of its own.

In any case, I would strongly suggest Xgboost pick one framework and play well with kubernetes. Rather than expose bindings for many many frameworks to work with Xgboost.

Just my 0.02$

On Thu, 17 Dec, 2020, 19:10 Jiaming Yuan, notifications@github.com wrote:

@sandys https://github.com/sandys Use https://kubernetes.dask.org/en/latest/ to create the cluster, and train xgboost dask as usual.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/6525#issuecomment-747445078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASYU4F5QKQYZ32IYSACHDSVIC6JANCNFSM4U7VFG7Q .

trivialfis commented 3 years ago

@sandys Thanks for your feedback. k8s is definitely important to us.

CarterFendley commented 1 year ago

Hi @trivialfis thanks for your work on this. Just reading these threads now.

I see in the other issue you recommend using sklearn optimizers for out-of-core datasets and dask_ml optimizers for datasets which can fit in memory. I am wondering if you can expand / explain more on why xgboost.dask is incompatible with dask_ml at the moment.

What would it take to, in your words, "design python wrappers that can glue them together."?

trivialfis commented 1 year ago

Hi @CarterFendley , there is ongoing work on HPO, could you please take a look at https://github.com/coiled/dask-xgboost-nyctaxi and see if it helps?