arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.85k stars 443 forks source link

Qwen2.5 LoRA Extraction not working in vLLM & Aphrodite Engine #459

Open Nero10578 opened 5 days ago

Nero10578 commented 5 days ago

Usually you can use LoRA extraction in mergekit and then run the LoRAs in vLLM or Aphrodite Engine just fine. This works for Llama and Mistral models so far, but it seems like this isn't working for Qwen2.5 models?

If I use my LoRA created from LoRA training using Axolotl, vLLM and Aphrodite Engine runs Qwen LoRAs just fine.

The extraction seems to work without issues too, just cannot be used.

Error traceback from Aphrodite Engine trying to run Qwen2.5-7B lora:

ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

Full traceback:

Future exception was never retrieved
future: <Future finished exception=RuntimeError('Loading lora /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora failed')>
Traceback (most recent call last):
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 92, in _load_adapter
    lora = self._lora_model_cls.from_local_checkpoint(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/models.py", line 221, in from_local_checkpoint
    raise ValueError(
ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/endpoints/openai/rpc/server.py", line 119, in generate
    async for request_output in results_generator:
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 917, in generate
    async for output in await self.add_request(
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 110, in generator
    raise result
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 51, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 784, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 727, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 283, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 163, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/aphrodite/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/task_handler/worker_base.py", line 301, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/aphrodite/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 1494, in execute_model
    self.set_active_loras(model_input.lora_requests,
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 1140, in set_active_loras
    self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 135, in set_active_adapters
    set_active_adapters_worker(requests, mapping, self._apply_adapters,
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/adapter_commons/utils.py", line 52, in set_active_adapters_worker
    apply_adapters_func(requests)
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 194, in _apply_adapters
    self.add_adapter(lora)
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 203, in add_adapter
    lora = self._load_adapter(lora_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 105, in _load_adapter
    raise RuntimeError(f"Loading lora {lora_path} failed") from e
RuntimeError: Loading lora /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora failed
jukofyork commented 2 days ago

ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

It doesn't like the input and output embeddings in the LoRA adapter.

They are valid to have in a LoRA, but it is a bit weird it lists them both twice?!

Can you try commenting out these two module_details.append lines and replacing with a pass like so:

        if module == pretrained_model.get_input_embeddings():
            # if isinstance(module, torch.nn.Embedding):
            pass #module_details.append(("embedding", name, module.weight.size()))   
        elif module == pretrained_model.get_output_embeddings():
            # if isinstance(module, torch.nn.Embedding):
            pass #module_details.append(("output", name, module.weight.size()))

and see if the LoRA it creates works OK?

Also can you tell me what the peak VRAM use is with these commented out to try to help with your other problem of high VRAM use? If it is just these causing a problem then I can easily add a command line option to skip the input/output embeddings, but if it still uses a lot of VRAM it must be something in the SVD function that upcasts some stuff to float32.


The "doubling listing" in the exception, makes me think it could also be something to do with having tied input/output tensors, but I think only the very tiny qwen models use this.

You can tell if you look in the config.json file:

"tie_word_embeddings": false

or in the model.safetensors.index.json file to see if both these are listed:

"lm_head.weight": "model-00037-of-00037.safetensors"
"model.embed_tokens.weight": "model-00001-of-00037.safetensors",
Nero10578 commented 1 day ago

ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

It doesn't like the input and output embeddings in the LoRA adapter.

They are valid to have in a LoRA, but it is a bit weird it lists them both twice?!

Can you try commenting out these two module_details.append lines and replacing with a pass like so:

    if module == pretrained_model.get_input_embeddings():
        # if isinstance(module, torch.nn.Embedding):
        pass #module_details.append(("embedding", name, module.weight.size()))   
    elif module == pretrained_model.get_output_embeddings():
        # if isinstance(module, torch.nn.Embedding):
        pass #module_details.append(("output", name, module.weight.size()))

and see if the LoRA it creates works OK?

Also can you tell me what the peak VRAM use is with these commented out to try to help with your other problem of high VRAM use? If it is just these causing a problem then I can easily add a command line option to skip the input/output embeddings, but if it still uses a lot of VRAM it must be something in the SVD function that upcasts some stuff to float32.

The "doubling listing" in the exception, makes me think it could also be something to do with having tied input/output tensors, but I think only the very tiny qwen models use this.

You can tell if you look in the config.json file:

"tie_word_embeddings": false

or in the model.safetensors.index.json file to see if both these are listed:

"lm_head.weight": "model-00037-of-00037.safetensors"
"model.embed_tokens.weight": "model-00001-of-00037.safetensors",

Will try this and get back to you. Thanks!