Open niranjanakella opened 5 months ago
Hi @niranjanakella!
Thank you for opening this issue. Just to clarify, would this be a quantized or nonquantized implementation?
@EricLBuehler Non-Quantized f16,32 implementation currently holds more precedence. But if possible, would also like to have a quantized implementation too.
Also I wish to know if LoRA adapters can be loaded at runtime without merging them into the model. It would be a huge game changer for most applications given the fact that many developers train multiple adapters. Would be great to attach multiple adapters during runtime.
Non-Quantized f16,32 implementation currently holds more precedence. But if possible, would also like to have a quantized implementation too.
Sounds great, I'll get started on an implementation.
Also I wish to know if LoRA adapters can be loaded at runtime without merging them into the model. It would be a huge game changer for most applications given the fact that many developers train multiple adapters. Would be great to attach multiple adapters during runtime.
We actually have this feature already! There are 2 ways to do this: 1) Activate adapters at runtime by preloading some and then sending requests to activate adapters 2) Use per-request adapter specification to have granular control.
Hi @niranjanakella! Sorry for the delay; I have been busy with the Idefics 2 implementation (#309). I should have a prototype ready tonight, though!
@EricLBuehler No problem sounds good. I am looking forward to trying it out soon.
See: #432.
Hi, is there any news for this? Is the PR in a usable state? I have the exact same use case, albeit with a quantized model.
Hello @EricLBuehler, opening this issue as part of T5 Seq2Seq model architecture support in mistral.rs. (As discussed)
Relates to: #156