Open torinchen opened 2 weeks ago
How many adapters do you need? Turbomind will only support lora without the "s-" in the future.
How many adapters do you need? Turbomind will only support lora without the "s-" in the future.
ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.
How many adapters do you need? Turbomind will only support lora without the "s-" in the future.
ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.
I agree! In deployment, sometimes we need more than 2 adapters to do different jobs. So it's meaningful if turbomind will support s-lora.
Motivation
In downstream tasks, Lora is one of the most common way to finetune llm. The inference speed degrades awfully from [turbomind backend+merge lora ]to [pytorch backend + merge lora] to [pytorch backend+s-lora](from 1x to 0.6x to 0.4x). Is there any chance to have a [turbomind backend + s-lora] to short the chain and boost the speed?
Related resources
No response
Additional context
No response