SkunkworksAI / hydra-moe

410 stars 15 forks source link

Draft : API Server #18

Open fearnworks opened 11 months ago

fearnworks commented 11 months ago
nivibilla commented 11 months ago

Hey, quick question. Since vLLM doesn't support LoRA. How are you planning to have different experts loading at the same time? I'm asking as I've been trying to figure out the same but I didn't get anywhere

fearnworks commented 11 months ago

We are currently exploring adding a custom mistral_moe model to vllm to handle loading the lora weights and any changes we need to apply to the forward passes.

nivibilla commented 11 months ago

Ah I see. Makes sense. Would that work with tensor parallel too?

fearnworks commented 11 months ago

Pivoting this to a general fastapi server