Open Hasna1994 opened 2 years ago
Could you please share the xgboost version and attach the worker log?
I see in line 64 'dask-worker Compute failed' but then it computes I guess then it fails... My packages dependencies are:
Any update on this?
Not yet. We will try to reproduce the error later, which can take some time.
Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.
Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:
Workers: 39 Total threads: 156 Total memory: 371.93 GiB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.
Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?
`X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test
dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)
output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`
I can't produce reproducible results because I did this all on Fargate AWS.