Open karthik-nexusflow opened 1 month ago
QLora only supports ZeRO2 For ZeRO3 please use LoRA
with zero2 tried running it runs fine till the first epoch , but fails during vlmm update
KeyError: 'base_model.model.model.norm.weight'
(ActorModelRayActor pid=1342974) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::LLMRayActor.update_weight() (pid=1343444, ip=0.0.0.0, actor_id=587ce79997d309074618728202000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7fc43c3e9bd0>)
(ActorModelRayActor pid=1342974) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ActorModelRayActor pid=1342974) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ActorModelRayActor pid=1342974) File "/tmp/ray/session_2024-05-19_00-15-15_748412_1317547/runtime_resources/working_dir_files/_ray_pkg_113e512fb2c5c1c0/openrlhf/trainer/ray/vllm_engine.py", line 90, in update_weight
(ActorModelRayActor pid=1342974) self.llm.llm_engine._run_workers("update_weight", name, dtype, shape, empty_cache)
(ActorModelRayActor pid=1342974) File "/root/miniconda3/envs/open/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
(ActorModelRayActor pid=1342974) self._run_workers_in_batch(workers, method, *args, *kwargs))
(ActorModelRayActor pid=1342974) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ActorModelRayActor pid=1342974) File "/root/miniconda3/envs/open/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch
(ActorModelRayActor pid=1342974) output = executor(args, **kwargs)
(ActorModelRayActor pid=1342974) ^^^^^^^^^^^^^^^^^^^^^^^^^
(ActorModelRayActor pid=1342974) File "/tmp/ray/session_2024-05-19_00-15-15_748412_1317547/runtime_resources/working_dir_files/_ray_pkg_113e512fb2c5c1c0/openrlhf/trainer/ray/vllm_engine.py", line 53, in update_weight
(ActorModelRayActor pid=1342974) self.model_runner.model.load_weights(model_name_or_path={name: weight})
(ActorModelRayActor pid=1342974) File "/root/miniconda3/envs/open/lib/python3.11/site-packages/vllm/model_executor/models/mistral.py", line 329, in load_weights
(ActorModelRayActor pid=1342974) param = params_dict[name]
(ActorModelRayActor pid=1342974) ~~~^^^^^^
(ActorModelRayActor pid=1342974) KeyError: 'base_model.model.lm_head.weight'
We did not implement vLLM support for LoRA
probably we need to have a remote function that inserts LORA adapters when sent a command
we then execute that for the first time we need to update LORA
then update weights , is this viable approach ? without vllm generation would be pretty slow how are ypu guys handling that ?
probably we need to have a remote function that inserts LORA adapters when sent a command
we then execute that for the first time we need to update LORA
then update weights , is this viable approach ? without vllm generation would be pretty slow how are ypu guys handling that ?
The easiest way is to merge the weights before sync weights
Hi team getting the following error while enabling 4-bit and LORA