InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.23k stars 381 forks source link

[Bug] lmdeploy pytorch backend slora cannot use TP=2 #1512

Closed jjjjohnson closed 4 months ago

jjjjohnson commented 4 months ago

Checklist

Describe the bug

Looks like in Pytorch backend Slora and TP cannot be turned on at the same time...

image

Reproduction

lmdeploy serve api_server \
    /path/to/qwen14bchat \
    --model-name qwen-14b \
    --server-port 23333 \
    --backend pytorch \
    --cache-max-entry-count 0.95 \
    --max-batch-size 256 --tp 2 \
    --adapters adap1=path2adap1 

Environment

lmdeploy 0.3.0

Error traceback

No response

grimoire commented 4 months ago

We have not support s-lora TP on ALL model since that different model have different qkv linear projection. If you can provide the adapter, we might be able to add the support you need.

jjjjohnson commented 4 months ago

Hi @grimoire Can you show which model type currently support s-lora TP? I will try to solve the issue and PR.

jjjjohnson commented 4 months ago

Qwen14b chat:

image

Qwen14b chat with lora:

image

And qwen.py mod.register_parameter(name, dist_param) is not able to deal with nested c_attn image

grimoire commented 4 months ago

All models with seprated q,k,v proj support tp. chatglm2 has similar pattern like qwen

https://github.com/InternLM/lmdeploy/blob/89fc8504d537f8b02cb368dcd808438f55f13b55/lmdeploy/pytorch/models/chatglm2.py#L124-L137

jjjjohnson commented 4 months ago

Thanks @grimoire Qwen has a combined q,k,v proj called c_attn, Looks like Baichuan is also very similar to Qwen.