Open luo-li-ba-suo opened 6 days ago
We should merge the weights before broadcasting? Or just broadcast the LoRA weights, as this can reduce communication overhead.
Just broadcasting the LoRA weights must work better! But I don't have any more computing resources to debug at the moment. This minor change won't not go wrong, I guess🙏
After updating the broadcast code, the following error may happen or not. I'm not sure whether it's because of my change
(LLMRayActor pid=1112, ip=172.26.5.7) update weight: model.embed_tokens.weight, dtype: torch.bfloat16, shape: torch.Size([128256, 8192])
(LLMRayActor pid=1112, ip=172.26.5.7)
(LLMRayActor pid=1112, ip=172.26.5.7) job-hvjmrsm8sqf4r2h5q2tl-worker-0:1112:13532 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Error executing method update_weight. This might cause deadlock in distributed execution.
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Traceback (most recent call last):
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] return executor(*args, kwargs)
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/openrlhf/trainer/ray/vllm_worker_wrap.py", line 43, in update_weight
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] torch.distributed.broadcast(weight, 0, group=self._model_update_group)
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] return func(*args, *kwargs)
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2140, in broadcast
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] work = group.broadcast([tensor], opts)
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Last error:
(LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Call to ibv_modify_qp failed with error Connection timed out
Traceback (most recent call last):
File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/examples/train_ppo_ray.py", line 297, in
Thus running 70B PPO on two nodes is much more comfortable. I've tried that it can work.