XGBoostError: rabit/internal/utils.h:90: Allreduce failed - Error while attempting XGboost on Dask Fargate Cluster in AWS

Hasna1994 commented 2 years ago

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.

Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:

Workers: 39 Total threads: 156 Total memory: 371.93 GiB So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.

Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?

`X_train = X_train.to_dask_array() X_test = X_test.to_dask_array() y_train = y_train y_test = y_test

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train( client, {"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"}, dtrain, num_boost_round=100, evals=[(dtrain, "train")])`

I can't produce reproducible results because I did this all on Fargate AWS.

trivialfis commented 2 years ago

Could you please share the xgboost version and attach the worker log?

Hasna1994 commented 2 years ago

log-events-viewer-result.csv

I see in line 64 'dask-worker Compute failed' but then it computes I guess then it fails... My packages dependencies are:

_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=2_kmp_llvm
_py-xgboost-mutex=2.0=cpu_0
aiobotocore=1.4.1=pyhd8ed1ab_0
aiohttp=3.7.0=py38h1e0a361_0
aioitertools=0.10.0=pyhd8ed1ab_0
arrow-cpp=4.0.1=py38hced866c_3
asttokens=2.0.5=pyhd3eb1b0_0
async-timeout=3.0.1=py_1000
attrs=21.4.0=pyhd8ed1ab_0
aws-c-common=0.4.57=he6710b0_1
aws-c-event-stream=0.1.6=h2531618_5
aws-checksums=0.1.9=he6710b0_0
aws-sdk-cpp=1.8.185=hce553d0_0
backcall=0.2.0=pyhd3eb1b0_0
blas=2.17=openblas
bokeh=2.4.2=py38h06a4308_0
boost-cpp=1.73.0=h27cfd23_11
boto3=1.17.106=pyhd8ed1ab_0
botocore=1.20.106=pyhd8ed1ab_0
bottleneck=1.3.4=py38hce1f21e_0
brotli=1.0.9=he6710b0_2
brotlipy=0.7.0=py38h27cfd23_1003
bzip2=1.0.8=h7b6447c_0
c-ares=1.18.1=h7f8727e_0
ca-certificates=2021.10.8=ha878542_0
certifi=2021.10.8=py38h578d9bd_2
cffi=1.15.0=py38h7f8727e_0
chardet=3.0.4=py38h924ce5b_1008
charset-normalizer=2.0.4=pyhd3eb1b0_0
click=8.0.4=py38h06a4308_0
cloudpickle=2.0.0=pyhd3eb1b0_0
colorama=0.4.4=pyhd3eb1b0_0
conda=4.12.0=py38h578d9bd_0
conda-env=2.6.0=0
conda-package-handling=1.8.1=py38h7f8727e_0
cryptography=36.0.0=py38h9ce1e76_0
cycler=0.11.0=pyhd3eb1b0_0
cytoolz=0.11.0=py38h7b6447c_0
dask=2022.2.1=pyhd3eb1b0_0
dask-core=2022.2.1=pyhd3eb1b0_0
dask-glm=0.2.0=py38_0
dask-ml=1.9.0=pyhd3eb1b0_0
dbus=1.13.18=hb2f20db_0
debugpy=1.5.1=py38h295c915_0
decorator=5.1.1=pyhd3eb1b0_0
distributed=2022.2.1=pyhd3eb1b0_0
double-conversion=3.1.5=he6710b0_1
entrypoints=0.4=py38h06a4308_0
executing=0.8.3=pyhd3eb1b0_0
expat=2.4.4=h295c915_0
fontconfig=2.13.1=h6c09931_0
fonttools=4.25.0=pyhd3eb1b0_0
freetype=2.11.0=h70c0345_0
fsspec=2021.10.1=pyhd8ed1ab_0
gflags=2.2.2=he6710b0_0
giflib=5.2.1=h7b6447c_0
glib=2.63.1=h5a9c865_0
glog=0.5.0=h2531618_0
grpc-cpp=1.26.0=hf8bcb03_0
gst-plugins-base=1.14.0=hbbd80ab_1
gstreamer=1.14.0=hb453b48_1
heapdict=1.0.1=pyhd3eb1b0_0
icu=58.2=he6710b0_3
idna=3.3=pyhd3eb1b0_0
ipykernel=6.9.1=py38h06a4308_0
ipython=8.2.0=py38h06a4308_0
jedi=0.18.1=py38h06a4308_1
jinja2=3.0.3=pyhd3eb1b0_0
jmespath=0.10.0=pyhd3eb1b0_0
joblib=1.1.0=pyhd3eb1b0_0
jpeg=9e=h7f8727e_0
jupyter_client=7.2.2=py38h06a4308_0
jupyter_core=4.9.2=py38h06a4308_0
kiwisolver=1.3.2=py38h295c915_0
krb5=1.19.2=hac12032_0
lcms2=2.12=h3be6417_0
ld_impl_linux-64=2.35.1=h7274673_9
libblas=3.8.0=17_openblas
libboost=1.73.0=h3ff78a5_11
libcblas=3.8.0=17_openblas
libcurl=7.82.0=h0b77cf5_0
libedit=3.1.20210910=h7f8727e_0
libev=4.33=h7f8727e_1
libevent=2.1.12=h8f2d780_0
libffi=3.2.1=1
libgcc-ng=11.2.0=h1d223b6_16
libgfortran-ng=7.5.0=ha8ba4b0_17
libgfortran4=7.5.0=ha8ba4b0_17
libgomp=11.2.0=h1d223b6_16
liblapack=3.8.0=17_openblas
liblapacke=3.8.0=17_openblas
libllvm11=11.1.0=h3826bc1_1
libnghttp2=1.46.0=hce63b2e_0
libopenblas=0.3.10=pthreads_hb3c22a3_5
libpng=1.6.37=hbc83047_0
libprotobuf=3.11.2=hd408876_0
libsodium=1.0.18=h7b6447c_0
libssh2=1.10.0=h8f2d780_0
libstdcxx-ng=9.3.0=hd4cf53a_17
libthrift=0.13.0=hfb8234f_6
libtiff=4.2.0=h85742a9_0
libuuid=1.0.3=h7f8727e_2
libwebp=1.2.2=h55f646e_0
libwebp-base=1.2.2=h7f8727e_0
libxcb=1.14=h7b6447c_0
libxgboost=1.5.0=h295c915_1
libxml2=2.9.10=hb55368b_3
llvm-openmp=12.0.1=h4bd325d_1
llvmlite=0.38.0=py38h4ff587b_0
locket=0.2.1=py38h06a4308_2
lz4-c=1.9.3=h295c915_1
markupsafe=2.0.1=py38h27cfd23_0
matplotlib=3.5.1=py38h06a4308_1
matplotlib-base=3.5.1=py38ha18d171_1
matplotlib-inline=0.1.2=pyhd3eb1b0_2
msgpack-python=1.0.2=py38hff7bd54_1
multidict=6.0.2=py38h0a891b7_1
multipledispatch=0.6.0=py38_0
munkres=1.1.4=py_0
ncurses=6.3=h7f8727e_2
nest-asyncio=1.5.5=py38h06a4308_0
nomkl=3.0=0
numba=0.55.1=py38h51133e4_0
numexpr=2.8.1=py38hecfb737_0
numpy=1.21.5=py38h7a5d4dd_1
numpy-base=1.21.5=py38hb8be1f0_1
openssl=1.1.1n=h166bdaf_0
orc=1.6.7=h973521d_2
packaging=21.3=pyhd3eb1b0_0
pandas=1.4.2=py38h295c915_0
parso=0.8.3=pyhd3eb1b0_0
partd=1.2.0=pyhd3eb1b0_1
pcre=8.45=h295c915_0
pexpect=4.8.0=pyhd3eb1b0_3
pickleshare=0.7.5=pyhd3eb1b0_1003
pillow=9.0.1=py38h22f2fdc_0
pip=21.2.4=py38h06a4308_0
prompt-toolkit=3.0.20=pyhd3eb1b0_0
psutil=5.8.0=py38h27cfd23_1
ptyprocess=0.7.0=pyhd3eb1b0_2
pure_eval=0.2.2=pyhd3eb1b0_0
py-xgboost=1.5.0=py38h06a4308_1
pyarrow=4.0.1=py38he0739d4_3
pycosat=0.6.3=py38h7b6447c_1
pycparser=2.21=pyhd3eb1b0_0
pygments=2.11.2=pyhd3eb1b0_0
pyopenssl=22.0.0=pyhd3eb1b0_0
pyparsing=3.0.4=pyhd3eb1b0_0
pyqt=5.9.2=py38h05f1152_4
pysocks=1.7.1=py38h06a4308_0
python=3.8.2=hcf32534_0
python-dateutil=2.8.2=pyhd3eb1b0_0
python_abi=3.8=2_cp38
pytz=2021.3=pyhd3eb1b0_0
pyyaml=6.0=py38h7f8727e_1
pyzmq=22.3.0=py38h295c915_2
qt=5.9.7=h5867ecd_1
re2=2020.11.01=h2531618_1
readline=8.1.2=h7f8727e_1
requests=2.27.1=pyhd3eb1b0_0
ruamel_yaml=0.15.100=py38h27cfd23_0
s3fs=2021.10.1=pyhd8ed1ab_0
s3transfer=0.4.2=pyhd8ed1ab_0
scikit-learn=1.0.2=py38h51133e4_1
scipy=1.7.3=py38h492baa0_0
setuptools=61.2.0=py38h06a4308_0
sip=4.19.13=py38h295c915_0
six=1.16.0=pyhd3eb1b0_1
snappy=1.1.9=h295c915_0
sortedcontainers=2.4.0=pyhd3eb1b0_0
sqlite=3.38.2=hc218d9a_0
stack_data=0.2.0=pyhd3eb1b0_0
tbb=2021.5.0=hd09550d_0
tblib=1.7.0=pyhd3eb1b0_0
threadpoolctl=2.2.0=pyh0d69192_0
tk=8.6.11=h1ccaba5_0
toolz=0.11.2=pyhd3eb1b0_0
tornado=6.1=py38h27cfd23_0
tqdm=4.63.0=pyhd3eb1b0_0
traitlets=5.1.1=pyhd3eb1b0_0
typing_extensions=4.1.1=pyh06a4308_0
uriparser=0.9.3=he6710b0_1
urllib3=1.26.8=pyhd3eb1b0_0
utf8proc=2.6.1=h27cfd23_0
wcwidth=0.2.5=pyhd3eb1b0_0
wheel=0.37.1=pyhd3eb1b0_0
wrapt=1.14.0=py38h0a891b7_1
xgboost=1.5.0=py38h06a4308_1
xz=5.2.5=h7b6447c_0
yaml=0.2.5=h7b6447c_0
yarl=1.7.2=py38h0a891b7_2
zeromq=4.3.4=h2531618_0
zict=2.0.0=pyhd3eb1b0_0
zlib=1.2.12=h7f8727e_2
zstd=1.4.9=haebb681_0

Hasna1994 commented 2 years ago

Any update on this?

trivialfis commented 2 years ago

Not yet. We will try to reproduce the error later, which can take some time.

dmlc / xgboost

XGBoostError: rabit/internal/utils.h:90: Allreduce failed - Error while attempting XGboost on Dask Fargate Cluster in AWS #7868