dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.26k stars 8.72k forks source link

XGBoostError: rabit/internal/utils.h:90: Allreduce failed - Error while attempting XGboost on Dask Fargate Cluster in AWS #7868

Open Hasna1994 opened 2 years ago

Hasna1994 commented 2 years ago

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.

Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:

Workers: 39 Total threads: 156 Total memory: 371.93 GiB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.

Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?

`X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`

I can't produce reproducible results because I did this all on Fargate AWS.

trivialfis commented 2 years ago

Could you please share the xgboost version and attach the worker log?

Hasna1994 commented 2 years ago

log-events-viewer-result.csv

I see in line 64 'dask-worker Compute failed' but then it computes I guess then it fails... My packages dependencies are:

Hasna1994 commented 2 years ago

Any update on this?

trivialfis commented 2 years ago

Not yet. We will try to reproduce the error later, which can take some time.