Dask support - Githubissues

hugocool commented 1 year ago

I love the xgboost distribution package and what it enables, however when dealing with datasets or trees that do not fit into memory one needs to scale the task using a distributed framework like dask.

Dask already support xgboost natively using the sklearn api, and since xgboost-distribution relies on the original xgboost, I thought it would be quite easy to swap the underlying booster for a distributed one. since the API would be almost identical.

from distributed import LocalCluster, Client
import xgboost as xgb

def main(client: Client) -> None:
    X, y = load_data()
    regr = xgb.dask.DaskXGBRegressor(n_estimators=100, tree_method="gpu_hist")
    regr.client = client  # assign the client
    regr.fit(X, y, eval_set=[(X, y)])
    preds = regr.predict(X)

This problem also pops up when you want to use federated learning, in which case one would like to use a federated booster.

So my question is, would it be possible to swap the underlying xgboost booster in xgboost-distribution for the aforementioned xgb.dask.DaskXGBRegressor?

CDonnerer commented 1 year ago

Hi, Thanks for raising. Just to understand the use case, you would like to train xgboost distribution on datasets that do not fit in memory?

I'll take a look into how feasible this is.

hugocool commented 1 year ago

yes, exactly!

CDonnerer / xgboost-distribution

Dask support #78