InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.24k stars 288 forks source link

[Bug] AWQ Model Fails Loading ADapter #1915

Open vladrad opened 5 days ago

vladrad commented 5 days ago

Checklist

Describe the bug

When running the repo example I choose: YurtsAI/Meta-Llama-3-8B-Instruct-AWQ model and traderpedroso/llama3-8b-lora this adapter.

I know the adapter was trained on the 4bit base model. Im not sure if this works with awq

    self.engine = Engine(model_path=model_path,
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 153, in __init__
    _paging_adapters(adapters,
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 68, in _paging_adapters
    model_agent.paging_adapters(weight_maps)
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 715, in paging_adapters
    weight_map.cache_adapter(lora_linears, cpu_caches)
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/adapter/adapter.py", line 226, in cache_adapter
    assert len(lora_linears) == len(caches), (
AssertionError: len(lora_linears) == len(caches)

If I comment out len(lora_linears) == len(caches) then the adapter merges... but im not sure if that its supposed to work like that or not.

Reproduction

My script:

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048,
                                     adapters=dict(lora_name_1='traderpedroso/llama3-8b-lora'))
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('YurtsAI/Meta-Llama-3-8B-Instruct-AWQ',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': '您猜怎么着'
}]]
response = pipe(prompts, gen_config=gen_config, adapter_name='lora_name_1')
print(response)

Environment

Running latest version of LMDeploy.

Error traceback

No response

lvhan028 commented 5 days ago

PytorchEngine的4bit推理,还在开发中:https://github.com/InternLM/lmdeploy/pull/1913

We are implementing the 4bit quantized model (awq quantization method) in pytorch engine (#1913). Stay tuned.

vladrad commented 4 days ago

Wow you all are fast

vladrad commented 4 days ago

Let me know if I can help out. Id be happy to test, im also capable of coding but this area is not my expertise :D . So this would mean any lora adapter should be able to mount on top of a AWQ quant model? Or do I need to fine-tune on a AWQ model. Seems like the lora adapter would just be mounted on top.

You all are amazing.

grimoire commented 1 day ago

https://github.com/InternLM/lmdeploy/pull/1913

PyTorchEngine use AwqLoraLinear, Adapters can be applied on awq model without fine-tune. Base linear would be forward with w4a16 support while adapters in fp16.