Closed atomic closed 1 year ago
I received an error when doing distributed training on GPU through Ray with following XGBoost params configs (pulled from my comet.ml logs):
COMET INFO: xgboost.early_stopping_num_rounds : 600 COMET INFO: xgboost.eval_metric : auc COMET INFO: xgboost.feature_importance_method : SHAP_MEAN COMET INFO: xgboost.learn_rate : 0.009999999776482582 COMET INFO: xgboost.max_count : 6500 COMET INFO: xgboost.max_depth : 4 COMET INFO: xgboost.optimization_objective : binary:logistic COMET INFO: xgboost.scale_pos_weight : 124.0 COMET INFO: xgboost.subsample_ratio : 0.800000011920929 COMET INFO: xgboost.tree_method : gpu_hist COMET INFO: xgboost.type : classification
The ray clusters compute configuration is roughly the following
head_num_gpus: 0 head_sku: None : host_docker: None : init_command_line: None : is_prod: True max_workers: 11 memory_size_mb: 28000 min_workers: 10 no_cache: True num_cpus: 2 num_gpus: 1
Following is the relevant error logs from the ray that produced this issue.
(run_ray_remote pid=842, ip=...) Traceback (most recent call last): (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost_ray/main.py", line 1097, in _train (run_ray_remote pid=842, ip=...) ray.get(ready) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper (run_ray_remote pid=842, ip=...) return func(*args, **kwargs) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/ray/worker.py", line 1934, in get (run_ray_remote pid=842, ip=...) raise value.as_instanceof_cause() (run_ray_remote pid=842, ip=...) ray.exceptions.RayTaskError(RayXGBoostTrainingError): ray::_RemoteRayXGBoostActor.train() (pid=842, ip=.., repr=<xgboost_ray.main._RemoteRayXGBoostActor object at 0x7f39dce89080>) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost/training.py", line 196, in train (run_ray_remote pid=842, ip=...) early_stopping_rounds=early_stopping_rounds) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost/training.py", line 81, in _train_internal (run_ray_remote pid=842, ip=...) bst.update(dtrain, i, obj) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost/core.py", line 1682, in update (run_ray_remote pid=842, ip=...) dtrain.handle)) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost/core.py", line 218, in _check_call (run_ray_remote pid=842, ip=...) raise XGBoostError(py_str(_LIB.XGBGetLastError())) (run_ray_remote pid=842, ip=...) xgboost.core.XGBoostError: [04:30:24] ../src/tree/updater_gpu_hist.cu:770: Exception in gpu_hist: [04:30:24] ../src/common/device_helpers.cuh:132: NCCL failure :unhandled system error ../src/common/device_helpers.cu(67) (run_ray_remote pid=842, ip=...) Stack trace: (run_ray_remote pid=842, ip=...) [bt] (0) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x31ddcd) [0x7f37948dfdcd] (run_ray_remote pid=842, ip=...) [bt] (1) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x320f29) [0x7f37948e2f29] (run_ray_remote pid=842, ip=...) [bt] (2) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x31f12a) [0x7f37948e112a] (run_ray_remote pid=842, ip=...) [bt] (3) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x4d7f42) [0x7f3794a99f42] (run_ray_remote pid=842, ip=...) [bt] (4) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x4e1d26) [0x7f3794aa3d26] (run_ray_remote pid=842, ip=...) [bt] (5) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x17d683) [0x7f379473f683] (run_ray_remote pid=842, ip=...) [bt] (6) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x17e207) [0x7f3794740207] (run_ray_remote pid=842, ip=...) [bt] (7) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x1b49fa) [0x7f37947769fa] (run_ray_remote pid=842, ip=...) [bt] (8) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7f379465b4f8] (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) Stack trace: (run_ray_remote pid=842, ip=...) [bt] (0) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x4c2fe9) [0x7f3794a84fe9] (run_ray_remote pid=842, ip=...) [bt] (1) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x4e207f) [0x7f3794aa407f] (run_ray_remote pid=842, ip=...) [bt] (2) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x17d683) [0x7f379473f683] (run_ray_remote pid=842, ip=...) [bt] (3) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x17e207) [0x7f3794740207] (run_ray_remote pid=842, ip=...) [bt] (4) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x1b49fa) [0x7f37947769fa] (run_ray_remote pid=842, ip=...) [bt] (5) /usr/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7f379465b4f8] (run_ray_remote pid=842, ip=...) [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f39e8b958ee] (run_ray_remote pid=842, ip=...) [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f39e8b952bf] (run_ray_remote pid=842, ip=...) [bt] (8) /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x466) [0x7f39e8bb24c6] (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) The above exception was the direct cause of the following exception: (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) ray::_RemoteRayXGBoostActor.train() (pid=842, ip=.., repr=<xgboost_ray.main._RemoteRayXGBoostActor object at 0x7f39dce89080>) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost_ray/main.py", line 654, in train (run_ray_remote pid=842, ip=...) raise RayXGBoostTrainingError("Training failed.") from raise_from (run_ray_remote pid=842, ip=...) xgboost_ray.main.RayXGBoostTrainingError: Training failed. (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) The above exception was the direct cause of the following exception: (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) Traceback (most recent call last): (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost_ray/main.py", line 1417, in train (run_ray_remote pid=842, ip=...) **kwargs) (run_ray_remote pid=842, ip=...) File "/usr/lib/python3.6/site-packages/xgboost_ray/main.py", line 1120, in _train (run_ray_remote pid=842, ip=...) raise RayActorError from exc (run_ray_remote pid=842, ip=...) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. (run_ray_remote pid=842, ip=...) (run_ray_remote pid=842, ip=...) The above exception was the direct cause of the following exception:
XGBoost Version=1.5.2
Could you please:
Closing due to stalled.
I received an error when doing distributed training on GPU through Ray with following XGBoost params configs (pulled from my comet.ml logs):
The ray clusters compute configuration is roughly the following
Following is the relevant error logs from the ray that produced this issue.
XGBoost Version=1.5.2