dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.17k stars 8.71k forks source link

XGBoost Dask Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory - after Optuna optimize 240 trials #9793

Open pangjac opened 10 months ago

pangjac commented 10 months ago

I am using XGBoost Dask to train a regression model.

I use Optuna to tune the process to find the best parameter. Once you defined the function objective, below is a typical Optuna tuning structure to find the parameters. I noticed that n_trials max possible value is 240, once optuna achieve the 240th trial, I will get cudaErrorMemoryAllocation error:

xgboost.core.XGBoostError: [20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/src/c_api/../data/../common/device_helpers.cuh:431: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory

  • Free memory: 1114832896
  • Requested memory: 1353570016
def objective(trial):
     # init xgboost parameter 
     output = xgb.dask.train(...)
     # The trained model
     bst = output['booster' ]
     preds = xgb.dask.predict(client, bst, dvalid) # dask array object

     y_true=y_test_dd.to_dask_array(lengths=True)
     score = customized_mode_score_func(y_true, preds)
     return score

study = optuna.create_study(directions=["maximize"])
study.optimize(objective, n_trials=50, timeout=None, gc_after_trial=True, callbacks=[print_best_trial_so_far])

[Training Env] I am using AWS EC2 g5.48xlarge instance, which is a multi GPU (GPU 8, GPU Memory 192G, vCPU 192G, Memory 768G). When setting dask client, I have

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4,5,6,7' # set CUDA_VISIBLE_DEVICES to the list of GPU IDs to use
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster(n_workers = 8, threads_per_worker=4, CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7')

A full throw-out error log from jupyter console is below.

2023-11-17 20:49:35,095 | INFO | 244
INFO:__main__:244
2023-11-17 20:49:35,097 | INFO | params: {'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}
INFO:__main__:params: {'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}
INFO:distributed.worker:Run out-of-band function '_start_tracker'
[20:49:35] task [xgboost.dask-0]:tcp://127.0.0.1:45261 got new rank 0
[20:49:35] task [xgboost.dask-1]:tcp://127.0.0.1:37833 got new rank 1
[20:49:35] task [xgboost.dask-2]:tcp://127.0.0.1:41457 got new rank 2
[20:49:35] task [xgboost.dask-3]:tcp://127.0.0.1:36157 got new rank 3
[20:49:35] task [xgboost.dask-4]:tcp://127.0.0.1:37061 got new rank 4
[20:49:35] task [xgboost.dask-5]:tcp://127.0.0.1:36381 got new rank 5
[20:49:35] task [xgboost.dask-6]:tcp://127.0.0.1:35999 got new rank 6
[20:49:35] task [xgboost.dask-7]:tcp://127.0.0.1:44107 got new rank 7
2023-11-17 20:49:44,269 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-acb69aa6-b4d0-4445-af0c-a2f1ff4b43a5
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':          dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
2367832                  283.656667  ...              1.0
2367833                  283.656667  ...              1.0
2367834                  283.656667  ...              1.0
2367835                  283.656667  ...              1.0
2367836                  283.656667  ... 
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/src/c_api/../data/../common/device_helpers.cuh:431: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\\n- Free memory: 1114832896\\n- Requested memory: 1353570016\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x77f79a) [0x7fdf58b6379a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x783994) [0x7fdf58b67994]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x121f6c) [0x7fdf58505f6c]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83f8f1) [0x7fdf58c238f1]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83fef2) [0x7fdf58c23ef2]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x41589e) [0x7fdf587f989e]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb08e8c) [0x7fdf58eece8c]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb085c3) [0x7fdf58eec5c3]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb40297) [0x7fdf58f24297]\\n\\n\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7fdf58f0bf2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7fdf58f2c5c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7fdf58844c79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7fdf5884576c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7fdf588a94f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fdf58545ef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe10040c9dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe10040c067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe100424d39]\\n\\n')"

[W 2023-11-17 20:49:44,281] Trial 244 failed with parameters: {'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'} because of the following error: XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/src/c_api/../data/../common/device_helpers.cuh:431: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\n- Free memory: 1114832896\n- Requested memory: 1353570016\n\nStack trace:\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x77f79a) [0x7fdf58b6379a]\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x783994) [0x7fdf58b67994]\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x121f6c) [0x7fdf58505f6c]\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83f8f1) [0x7fdf58c238f1]\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83fef2) [0x7fdf58c23ef2]\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x41589e) [0x7fdf587f989e]\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb08e8c) [0x7fdf58eece8c]\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb085c3) [0x7fdf58eec5c3]\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb40297) [0x7fdf58f24297]\n\n\n\nStack trace:\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7fdf58f0bf2a]\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7fdf58f2c5c9]\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7fdf58844c79]\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7fdf5884576c]\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7fdf588a94f7]\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fdf58545ef0]\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe10040c9dd]\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe10040c067]\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe100424d39]\n\n').
Traceback (most recent call last):
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "/tmp/ipykernel_47920/3098887188.py", line 167, in objective_internal
    output = xgb.dask.train(client, params=param, dtrain=dtrain, num_boost_round=1000,
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py", line 729, in inner_f
    return func(**kwargs)
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py", line 1079, in train
    return client.sync(
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/utils.py", line 349, in sync
    return sync(
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/utils.py", line 416, in sync
    raise exc.with_traceback(tb)
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/utils.py", line 389, in f
    result = yield future
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py", line 1015, in _train_async
    results = await map_worker_partitions(
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py", line 532, in map_worker_partitions
    results = await client.gather(futures)
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/client.py", line 2208, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py", line 986, in dispatched_train
    booster = worker_train(
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py", line 729, in inner_f
    return func(**kwargs)
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/training.py", line 181, in train
    bst.update(dtrain, i, obj)
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py", line 2049, in update
    _check_call(
  File "/opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py", line 281, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/src/c_api/../data/../common/device_helpers.cuh:431: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 1114832896
- Requested memory: 1353570016

Stack trace:
  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x77f79a) [0x7fdf58b6379a]
  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x783994) [0x7fdf58b67994]
  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x121f6c) [0x7fdf58505f6c]
  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83f8f1) [0x7fdf58c238f1]
  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83fef2) [0x7fdf58c23ef2]
  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x41589e) [0x7fdf587f989e]
  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb08e8c) [0x7fdf58eece8c]
  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb085c3) [0x7fdf58eec5c3]
  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb40297) [0x7fdf58f24297]

Stack trace:
  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7fdf58f0bf2a]
  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7fdf58f2c5c9]
  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7fdf58844c79]
  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7fdf5884576c]
  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7fdf588a94f7]
  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fdf58545ef0]
  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe10040c9dd]
  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe10040c067]
  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe100424d39]

[W 2023-11-17 20:49:44,282] Trial 244 failed with value None.
2023-11-17 20:49:44,609 - distributed.utils_perf - WARNING - full garbage collections took 61% CPU time recently (threshold: 10%)
WARNING:distributed.utils_perf:full garbage collections took 61% CPU time recently (threshold: 10%)
2023-11-17 20:49:44,765 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-f422435a-2c70-4428-ada3-c196c5d43f78
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':          dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
2029571                  665.556667  ...              1.0
2029572                  286.333333  ...              3.0
2029573                  286.333333  ...              3.0
2029574                  286.333333  ...              3.0
2029575                  286.333333  ... 
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/rabit/include/rabit/internal/utils.h:86: Allreduce failed\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7f91e33ecf2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7f91e340d5c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7f91e2d25c79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7f91e2d2676c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7f91e2d8a4f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7f91e2a26ef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f938a98f9dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f938a98f067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f938a9a7d39]\\n\\n')"

2023-11-17 20:49:44,771 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-3bbf3fb4-3df0-4bc2-93b6-bee1cbaa833f
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':         dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
0                       -11.666667  ...              3.0
1                       -11.666667  ...              3.0
2                       -11.666667  ...              3.0
3                       -11.666667  ...              3.0
4                       -11.666667  ...       
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/rabit/include/rabit/internal/utils.h:86: Allreduce failed\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7fba7991ff2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7fba799405c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7fba79258c79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7fba7925976c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7fba792bd4f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fba78f59ef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fbc2313d9dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fbc2313d067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fbc23155d39]\\n\\n')"

2023-11-17 20:49:44,775 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-bcd0ce25-c69b-438d-8571-6ca3c6d2cbde
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':          dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
676524                -1.000000e+09  ...              3.0
676525                -1.000000e+09  ...              3.0
676526                 7.990000e+02  ...              1.0
676527                 7.990000e+02  ...              1.0
676528                 7.990000e+02  ... 
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/rabit/include/rabit/internal/utils.h:86: Allreduce failed\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7f17e5372f2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7f17e53935c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7f17e4cabc79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7f17
---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
Input In [36], in <module>
      1 # study = optuna.create_study(directions=["maximize"])
----> 2 study.optimize(objective, n_trials=50, timeout=None, gc_after_trial=True, callbacks=[print_best_trial_so_far])

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/optuna/study/study.py:451, in Study.optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
    348 def optimize(
    349     self,
    350     func: ObjectiveFuncType,
   (...)
    357     show_progress_bar: bool = False,
    358 ) -> None:
    359     """Optimize an objective function.
    360 
    361     Optimization is done by choosing a suitable set of hyperparameter values from a given
   (...)
    449             If nested invocation of this method occurs.
    450     """
--> 451     _optimize(
    452         study=self,
    453         func=func,
    454         n_trials=n_trials,
    455         timeout=timeout,
    456         n_jobs=n_jobs,
    457         catch=tuple(catch) if isinstance(catch, Iterable) else (catch,),
    458         callbacks=callbacks,
    459         gc_after_trial=gc_after_trial,
    460         show_progress_bar=show_progress_bar,
    461     )

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/optuna/study/_optimize.py:66, in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
     64 try:
     65     if n_jobs == 1:
---> 66         _optimize_sequential(
     67             study,
     68             func,
     69             n_trials,
     70             timeout,
     71             catch,
     72             callbacks,
     73             gc_after_trial,
     74             reseed_sampler_rng=False,
     75             time_start=None,
     76             progress_bar=progress_bar,
     77         )
     78     else:
     79         if n_jobs == -1:

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/optuna/study/_optimize.py:163, in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar)
    160         break
    162 try:
--> 163     frozen_trial = _run_trial(study, func, catch)
    164 finally:
    165     # The following line mitigates memory problems that can be occurred in some
    166     # environments (e.g., services that use computing containers such as GitHub Actions).
    167     # Please refer to the following PR for further details:
    168     # https://github.com/optuna/optuna/pull/325.
    169     if gc_after_trial:

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/optuna/study/_optimize.py:251, in _run_trial(study, func, catch)
    244         assert False, "Should not reach."
    246 if (
    247     frozen_trial.state == TrialState.FAIL
    248     and func_err is not None
    249     and not isinstance(func_err, catch)
    250 ):
--> 251     raise func_err
    252 return frozen_trial

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/optuna/study/_optimize.py:200, in _run_trial(study, func, catch)
    198 with get_heartbeat_thread(trial._trial_id, study._storage):
    199     try:
--> 200         value_or_values = func(trial)
    201     except exceptions.TrialPruned as e:
    202         # TODO(mamu): Handle multi-objective cases.
    203         state = TrialState.PRUNED

Input In [29], in objective_internal(trial, dtrain, dvalid)
    165 logger.info(f"params: {param}")
    166 # Train the model
--> 167 output = xgb.dask.train(client, params=param, dtrain=dtrain, num_boost_round=1000, 
    168                         evals=[(dtrain, "train"),(dvalid, "validation")], 
    169                         custom_metric=eval_metric_wrapper_xgboost_metrics,
    170                         callbacks=[
    171                             XGBLogging(epoch_log_interval=5), 
    172                             XGBCustomEarlyStoppingByMetricValueThreshold(stopping_on_data="validation", 
    173                                                                                 metric_name="eval_metric_nmae_absolute", stopping_metric_limit=2.0, stopping_ops=">"),
    174                             XGBCustomEarlyStoppingByMetricImprovement(   stopping_on_data="validation", 
    175                                                                              metric_name="eval_metric_nmae_absolute", stopping_rounds=10, stopping_ops=">")
    176                         ],
    177                         verbose_eval=True
    178                        )
    179 # The trained model
    180 bst = output['booster']

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py:729, in require_keyword_args.<locals>.throw_if.<locals>.inner_f(*args, **kwargs)
    727 for k, arg in zip(sig.parameters, args):
    728     kwargs[k] = arg
--> 729 return func(**kwargs)

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py:1079, in train(client, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, xgb_model, verbose_eval, callbacks, custom_metric)
   1077 client = _xgb_get_client(client)
   1078 args = locals()
-> 1079 return client.sync(
   1080     _train_async,
   1081     global_config=config.get_config(),
   1082     dconfig=_get_dask_config(),
   1083     **args,
   1084 )

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/utils.py:349, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    347     return future
    348 else:
--> 349     return sync(
    350         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    351     )

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/utils.py:416, in sync(loop, func, callback_timeout, *args, **kwargs)
    414 if error:
    415     typ, exc, tb = error
--> 416     raise exc.with_traceback(tb)
    417 else:
    418     return result

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/utils.py:389, in sync.<locals>.f()
    387         future = wait_for(future, callback_timeout)
    388     future = asyncio.ensure_future(future)
--> 389     result = yield future
    390 except Exception:
    391     error = sys.exc_info()

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py:1015, in _train_async(client, global_config, dconfig, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, verbose_eval, xgb_model, callbacks, custom_metric)
   1012     evals_name = []
   1013     evals_id = []
-> 1015 results = await map_worker_partitions(
   1016     client,
   1017     dispatched_train,
   1018     # extra function parameters
   1019     params,
   1020     _rabit_args,
   1021     id(dtrain),
   1022     evals_name,
   1023     evals_id,
   1024     *([dtrain] + evals_data),
   1025     # workers to be used for training
   1026     workers=workers,
   1027 )
   1028 return list(filter(lambda ret: ret is not None, results))[0]

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py:532, in map_worker_partitions(client, func, workers, *refs)
    528     fut = client.submit(
    529         func, *args, pure=False, workers=[addr], allow_other_workers=False
    530     )
    531     futures.append(fut)
--> 532 results = await client.gather(futures)
    533 return results

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/distributed/client.py:2208, in Client._gather(self, futures, errors, direct, local_worker)
   2206         exc = CancelledError(key)
   2207     else:
-> 2208         raise exception.with_traceback(traceback)
   2209     raise exc
   2210 if errors == "skip":

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/dask.py:986, in dispatched_train()
    983         eval_Xy = _dmatrix_from_list_of_parts(**ref, nthread=n_threads)
    984     evals.append((eval_Xy, evals_name[i]))
--> 986 booster = worker_train(
    987     params=local_param,
    988     dtrain=Xy,
    989     num_boost_round=num_boost_round,
    990     evals_result=local_history,
    991     evals=evals if len(evals) != 0 else None,
    992     obj=obj,
    993     feval=feval,
    994     custom_metric=custom_metric,
    995     early_stopping_rounds=early_stopping_rounds,
    996     verbose_eval=verbose_eval,
    997     xgb_model=xgb_model,
    998     callbacks=callbacks,
    999 )
   1000 # Don't return the boosters from empty workers. It's quite difficult to
   1001 # guarantee everything is in sync in the present of empty workers,
   1002 # especially with complex objectives like quantile.
   1003 return _filter_empty(booster, local_history, Xy.num_row() != 0)

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py:729, in inner_f()
    727 for k, arg in zip(sig.parameters, args):
    728     kwargs[k] = arg
--> 729 return func(**kwargs)

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/training.py:181, in train()
    179 if cb_container.before_iteration(bst, i, dtrain, evals):
    180     break
--> 181 bst.update(dtrain, i, obj)
    182 if cb_container.after_iteration(bst, i, dtrain, evals):
    183     break

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py:2049, in update()
   2046 self._assign_dmatrix_features(dtrain)
   2048 if fobj is None:
-> 2049     _check_call(
   2050         _LIB.XGBoosterUpdateOneIter(
   2051             self.handle, ctypes.c_int(iteration), dtrain.handle
   2052         )
   2053     )
   2054 else:
   2055     pred = self.predict(dtrain, output_margin=True, training=True)

File /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/core.py:281, in _check_call()
    270 """Check the return value of C API call
    271 
    272 This function will raise exception when error occurs.
   (...)
    278     return value from API calls
    279 """
    280 if ret != 0:
--> 281     raise XGBoostError(py_str(_LIB.XGBGetLastError()))

XGBoostError: [20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/src/c_api/../data/../common/device_helpers.cuh:431: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 1114832896
- Requested memory: 1353570016

Stack trace:
  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x77f79a) [0x7fdf58b6379a]
  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x783994) [0x7fdf58b67994]
  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x121f6c) [0x7fdf58505f6c]
  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83f8f1) [0x7fdf58c238f1]
  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83fef2) [0x7fdf58c23ef2]
  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x41589e) [0x7fdf587f989e]
  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb08e8c) [0x7fdf58eece8c]
  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb085c3) [0x7fdf58eec5c3]
  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb40297) [0x7fdf58f24297]

Stack trace:
  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7fdf58f0bf2a]
  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7fdf58f2c5c9]
  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7fdf58844c79]
  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7fdf5884576c]
  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7fdf588a94f7]
  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fdf58545ef0]
  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe10040c9dd]
  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe10040c067]
  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe100424d39]

e4cac76c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7f17e4d104f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7f17e49acef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f198cae69dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f198cae6067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f198cafed39]\\n\\n')"

2023-11-17 20:49:44,779 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-8ff25287-3626-4781-a18d-61e624f1707e
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':          dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
1353048                  272.666667  ...              3.0
1353049                  272.666667  ...              3.0
1353050                  272.666667  ...              3.0
1353051                  272.666667  ...              3.0
1353052                  272.666667  ... 
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/rabit/include/rabit/internal/utils.h:86: Allreduce failed\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7f01434bbf2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7f01434dc5c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7f0142df4c79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7f0142df576c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7f0142e594f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7f0142af5ef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f02eaa6d9dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f02eaa6d067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f02eaa85d39]\\n\\n')"

2023-11-17 20:49:44,789 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-967638ba-e972-406d-b96f-f1fda81e9f93
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':          dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
1014786                    0.000000  ...              3.0
1014787                    0.000000  ...              3.0
1014788                    0.000000  ...              3.0
1014789                    0.000000  ...              3.0
1014790                    0.000000  ... 
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/rabit/include/rabit/internal/utils.h:86: Allreduce failed\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7f82d1ffcf2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7f82d201d5c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7f82d1935c79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7f82d193676c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7f82d199a4f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7f82d1636ef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f84795df9dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f84795df067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f84795f7d39]\\n\\n')"

2023-11-17 20:49:44,846 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-8fbd197d-5896-4129-9b70-4f4555acf6e8
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':          dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
1691310                 2508.626667  ...              1.0
1691311                 2508.626667  ...              1.0
1691312                 2508.626667  ...              1.0
1691313                 2508.626667  ...              1.0
1691314                 2508.626667  ... 
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/rabit/include/rabit/internal/utils.h:86: Allreduce failed\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7fa48ff49f2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7fa48ff6a5c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7fa48f882c79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7fa48f88376c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7fa48f8e74f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fa48f583ef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fa637ab99dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fa637ab9067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fa637ad1d39]\\n\\n')"

2023-11-17 20:49:44,886 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-6509836c-d655-44a5-b537-afca7e1b7d91
Function:  dispatched_train
args:      ({'objective': 'reg:absoluteerror', 'tree_method': 'hist', 'device': 'cuda', 'booster': 'gbtree', 'lambda': 3.8906809944664755, 'alpha': 0.10710602818277858, 'subsample': 0.7890954253446751, 'colsample_bytree': 0.7907761494054936, 'eta': 0.27622915791457714, 'gamma': 0.02484356072037374, 'max_depth': 4, 'min_child_weight': 100, 'grow_policy': 'depthwise'}, {'DMLC_NUM_WORKER': 8, 'DMLC_TRACKER_URI': '100.74.118.28', 'DMLC_TRACKER_PORT': 44379}, 140372775166688, ['train', 'validation'], [140372775166688, 140372775166592], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': -999999999, 'enable_categorical': False, 'parts': [{'data':         dr_extra_pay_sum_avg3m_all  ...  DR_BEHAVIOR_SEG
338262                4.616067e+02  ...              1.0
338263                4.632833e+02  ...              1.0
338264                4.632833e+02  ...              1.0
338265                4.632833e+02  ...              1.0
338266                4.632833e+02  ...       
kwargs:    {}
Exception: "XGBoostError('[20:49:44] /workspace/src/tree/updater_gpu_hist.cu:781: Exception in gpu_hist: [20:49:44] /workspace/rabit/include/rabit/internal/utils.h:86: Allreduce failed\\n\\nStack trace:\\n  [bt] (0) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb27f2a) [0x7fe82a6f8f2a]\\n  [bt] (1) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xb485c9) [0x7fe82a7195c9]\\n  [bt] (2) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x460c79) [0x7fe82a031c79]\\n  [bt] (3) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x46176c) [0x7fe82a03276c]\\n  [bt] (4) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c54f7) [0x7fe82a0964f7]\\n  [bt] (5) /opt/omniai/software/Miniconda/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fe829d32ef0]\\n  [bt] (6) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe9d1d429dd]\\n  [bt] (7) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe9d1d42067]\\n  [bt] (8) /opt/omniai/software/Miniconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe9d1d5ad39]\\n\\n')"
trivialfis commented 10 months ago

Hi, are you using DaskQuantileDMatrix?

pangjac commented 10 months ago

Hi, No. I use regular DaskDMatrix. The way to construct dtrain and dvalid is below

import dask.array as da
import dask.dataframe as dd
# tra and itv are regular pandas dataframe
# `feature_list` is a list of strings of feature names
# `dependent` is the string of target variable name
tra = tra.fillna(-999999999)
itv = itv.fillna(-999999999)
X_train = tra[feature_list]
y_train = tra[dependent]
X_test = itv[feature_list]
y_test = itv[dependent]

# construct dask dataframe from regular pandas dataframe 
X_train_dd = dd.from_pandas(X_train, npartitions=8) #set npartitions=8 because GPU 8
y_train_dd = dd.from_pandas(y_train, npartitions=8)
X_test_dd = dd.from_pandas(X_test, npartitions=8)
y_test_dd = dd.from_pandas(y_test, npartitions=8)

# construct DaskDMatrix
import xgboost as xgb
dtrain = xgb.dask.DaskDMatrix(client, data=X_train_dd, label=y_train_dd, missing=-999999999)
dvalid = xgb.dask.DaskDMatrix(client, data=X_test_dd, label=y_test_dd, missing=-999999999)
trivialfis commented 10 months ago

Memory usage issues can happen when there are too many experiments running concurrently. Let's see if we can mitigate it somehow.

tra = tra.fillna(-999999999)

I'm not sure why is this necessary. XGBoost can handle NA. You might have just created an intermediate copy of the data.

X_train_dd = dd.from_pandas(X_train, npartitions=8) #set npartitions=8 because GPU 8

Feel free to use more partitions for granularity in scheduling. I don't think dask can perform its best when the number of partitions is strictly the same as the number of GPUs. Every operation has to be done on the entire partition, which can be memory-hungry.

X_train_dd = dd.from_pandas(X_train, npartitions=8) #set npartitions=8 because GPU 8

Also, it's usually considered best practice to not use from_pandas, instead one should prefer using dask from the beginning to avoid concentration of data and large data transfer.

dtrain = xgb.dask.DaskDMatrix(client, data=X_train_dd, label=y_train_dd, missing=-999999999)

Consider using DaskQuantileDMatrix, which saves a significant amount of memory when you are using the hist tree method.