dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.14k stars 8.71k forks source link

Integration of Distributed XGBoost on Modin #7094

Open prutskov opened 3 years ago

prutskov commented 3 years ago

Hi XGBoost!

I am from Modin team. Modin provides an efficient distributed DataFrames and has a distributed implementation of XGBoost.

XGBoost already has support of Modin DataFrames, but currently partitions of Modin DataFrame are just transformed to numpy.array-s and concatenated to one: https://github.com/dmlc/xgboost/blob/77f6cf2d134fb7b1fd9af7d80fecb300876015c8/python-package/xgboost/data.py#L244 and possible parallelization between partitions isn't used.

Modin XGBoost is implemented with Ray distribution technology under the hood but support of the other execution engines used in Modin (Dask e.g.) will be added as well. Training and inference happens in parallel between partitions of Modin DataFrame.

Modin team wants to start integration of Modin XGBoost in your repo to have support of distributed Modin DataFrames in the main xgboost package.

The high-level Modin XGBoost documentation can be found here. The developer's documentation with implementation details is here.

Are there any requirements for starting the integration?

trivialfis commented 3 years ago

other execution engines used in Modin (Dask e.g.) will be added as well

Can I assume this issue is about dask? Since dask is maintained in xgboost's source tree.

trivialfis commented 3 years ago

I will look into the example code you linked. Sorry for the early reply.