Closed bks5881 closed 5 months ago
Turbomind engine doesn't support s-lora
How many adapters do you need for 1 server instance?
I would say to begin with, atleast 1?
reopen it for further discussion
I would say to begin with, atleast 1?
If only 1 adpater is needed, you can just merge it into the original model. And this will give you fastest speed.
In the near future (maybe in June), turbomind is going to support the simpler case where storing all apdaters in VRAM is acceptable.
I would say to begin with, atleast 1?
If only 1 adpater is needed, you can just merge it into the original model. And this will give you fastest speed.
In the near future (maybe in June), turbomind is going to support the simpler case where storing all apdaters in VRAM is acceptable.
When we need an adapter to do some certain works while the base one (llama/internLM/qwen etc) still can do the common work. Merging lora adapter sometimes makes the model only better at new tasks. This will be very helpful.
Well, ideally i want to avoid merging weights as the
ideally want to have 5-5 Lora adapters without merging weights
Checklist
Describe the bug
I would like to launch a openai compatible endpoint with lora adapters but want to use TurbomindEngine and not PytorchEngine as the inference is very slow on it.
Reproduction
lmdeploy serve api_server v2ray/Llama-3-70B-Instruct --tp 4 --server-port 40047 --server-name 0.0.0.0 --adapters /home/user/project/trained_lora
Environment
Error traceback