Open fearnworks opened 11 months ago
Hey, quick question. Since vLLM doesn't support LoRA. How are you planning to have different experts loading at the same time? I'm asking as I've been trying to figure out the same but I didn't get anywhere
We are currently exploring adding a custom mistral_moe model to vllm to handle loading the lora weights and any changes we need to apply to the forward passes.
Ah I see. Makes sense. Would that work with tensor parallel too?
Pivoting this to a general fastapi server