[Feature] support s-lora in turbomind backend

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.33k stars 390 forks source link

[Feature] support s-lora in turbomind backend #2458

Open torinchen opened 2 weeks ago

torinchen commented 2 weeks ago

Motivation

In downstream tasks, Lora is one of the most common way to finetune llm. The inference speed degrades awfully from [turbomind backend+merge lora ]to [pytorch backend + merge lora] to [pytorch backend+s-lora](from 1x to 0.6x to 0.4x). Is there any chance to have a [turbomind backend + s-lora] to short the chain and boost the speed?

Related resources

No response

Additional context

No response

lzhangzz commented 2 weeks ago

How many adapters do you need? Turbomind will only support lora without the "s-" in the future.

torinchen commented 2 weeks ago

How many adapters do you need? Turbomind will only support lora without the "s-" in the future.

ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.

zzf2grx commented 13 hours ago

How many adapters do you need? Turbomind will only support lora without the "s-" in the future.

ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.

I agree! In deployment, sometimes we need more than 2 adapters to do different jobs. So it's meaningful if turbomind will support s-lora.