Open prutskov opened 3 years ago
other execution engines used in Modin (Dask e.g.) will be added as well
Can I assume this issue is about dask? Since dask is maintained in xgboost's source tree.
I will look into the example code you linked. Sorry for the early reply.
Hi XGBoost!
I am from Modin team. Modin provides an efficient distributed DataFrames and has a distributed implementation of XGBoost.
XGBoost already has support of Modin DataFrames, but currently partitions of Modin DataFrame are just transformed to
numpy.array
-s and concatenated to one: https://github.com/dmlc/xgboost/blob/77f6cf2d134fb7b1fd9af7d80fecb300876015c8/python-package/xgboost/data.py#L244 and possible parallelization between partitions isn't used.Modin XGBoost is implemented with Ray distribution technology under the hood but support of the other execution engines used in Modin (Dask e.g.) will be added as well. Training and inference happens in parallel between partitions of Modin DataFrame.
Modin team wants to start integration of Modin XGBoost in your repo to have support of distributed Modin DataFrames in the main xgboost package.
The high-level Modin XGBoost documentation can be found here. The developer's documentation with implementation details is here.
Are there any requirements for starting the integration?