OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

Failed to update weights to vLLM #313

Closed thirteenflt closed 4 weeks ago

thirteenflt commented 4 weeks ago

Anyone has any clue on this error?

(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     return func(*args, **kwargs)
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/worker/worker.py", line 286, in start_worker_execution_loop
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     while self._execute_model_non_driver():
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/worker/worker.py", line 295, in _execute_model_non_driver
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     data = broadcast_tensor_dict(src=0)
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     torch.distributed.broadcast_object_list(recv_metadata_list,
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     return func(*args, **kwargs)
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     broadcast(object_sizes_tensor, src=src, group=group)
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     return func(*args, **kwargs)
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148]     work.wait()
(RayWorkerWrapper pid=4183, ip=10.3.32.122) ERROR 06-03 23:13:39 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
hijkzzz commented 4 weeks ago

what is your vllm version?

thirteenflt commented 4 weeks ago

vllm version is 0.4.3 nvidia-nccl-cu12==2.20.5

hijkzzz commented 4 weeks ago

vllm version is 0.4.3 nvidia-nccl-cu12==2.20.5

vllm version is 0.4.3 nvidia-nccl-cu12==2.20.5

do you use our docker container (https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile) and NCCL between multiple nodes (such as IB)? I recommend you try vLLM 0.42 first, because for 0.43 we didn't test it enough.