Support LoRA+VLLM, especially for ZeRO-3.

luo-li-ba-suo commented 6 days ago

Thus running 70B PPO on two nodes is much more comfortable. I've tried that it can work.

hijkzzz commented 6 days ago

We should merge the weights before broadcasting? Or just broadcast the LoRA weights, as this can reduce communication overhead.

luo-li-ba-suo commented 6 days ago

Just broadcasting the LoRA weights must work better! But I don't have any more computing resources to debug at the moment. This minor change won't not go wrong, I guess🙏

luo-li-ba-suo commented 9 hours ago

After updating the broadcast code, the following error may happen or not. I'm not sure whether it's because of my change

(LLMRayActor pid=1112, ip=172.26.5.7) update weight: model.embed_tokens.weight, dtype: torch.bfloat16, shape: torch.Size([128256, 8192]) (LLMRayActor pid=1112, ip=172.26.5.7) (LLMRayActor pid=1112, ip=172.26.5.7) job-hvjmrsm8sqf4r2h5q2tl-worker-0:1112:13532 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Error executing method update_weight. This might cause deadlock in distributed execution. (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Traceback (most recent call last): (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] return executor(*args, kwargs) (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/openrlhf/trainer/ray/vllm_worker_wrap.py", line 43, in update_weight (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] torch.distributed.broadcast(weight, 0, group=self._model_update_group) (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] return func(*args, *kwargs) (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2140, in broadcast (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] work = group.broadcast([tensor], opts) (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Last error: (LLMRayActor pid=1112, ip=172.26.5.7) ERROR 06-30 06:10:12 worker_base.py:145] Call to ibv_modify_qp failed with error Connection timed out Traceback (most recent call last): File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/examples/train_ppo_ray.py", line 297, in train(args) File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/examples/train_ppo_ray.py", line 154, in train ray.get(refs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(DistBackendError): ray::ActorModelRayActor.fit() (pid=13168, ip=172.26.6.172, actor_id=251cb3e8d97bf988ca24362802000000, repr=<openrlhf.trainer.ray.ppo_actor.ActorModelRayActor object at 0x7f26c8a789d0>) File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/openrlhf/trainer/ray/ppo_actor.py", line 382, in fit trainer.fit(self.prompts_dataloader, self.pretrain_dataloader, args) File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/openrlhf/trainer/ppo_trainer.py", line 170, in fit self._broadcast_to_vllm() # Validate broadcasting weights to all vllm workers is working File "/tmp/ray/session_2024-06-30_05-59-44_592613_569/runtime_resources/working_dir_files/_ray_pkg_4323eb844e717d61/openrlhf/trainer/ray/ppo_actor.py", line 183, in _broadcast_to_vllm torch.distributed.broadcast(param.data, 0, group=self._model_update_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2140, in broadcast work = group.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: Call to ibv_modify_qp failed with error Connection timed out

OpenLLMAI / OpenRLHF

Support LoRA+VLLM, especially for ZeRO-3. #335