OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.41k stars 303 forks source link

Support peft's LoRa for HF transformer models. #1186

Open Palmik opened 1 year ago

Palmik commented 1 year ago

Context: With HF models, one can use peft to do parameter efficient tuning, the most popular (and afaik most performant) method being LoRa.

Idea: It would be great to be able to have an instance (in GPU memory) of a base HF transformer model (running with CT2) that you run with multiple instances of of LoRa weights.

Would be curious to hear if you think this could be done in CT2 in a generic way that's applicable to all HF transformer models (just like HF's peft).

trannhatquy commented 1 year ago

We have create a script to convert models trained with QLoRA to CTranslate2 to speed up inference here https://github.com/Actable-AI/llm-utils/blob/main/qlora2ct2/convert_qlora2_ct2.py

SebastianBodza commented 1 year ago

Any plan to support loras directly? Would be great to switch between loras :)

krzysiekpodk commented 1 year ago

Big fan of CT2 here as well, changing LorA would allow for the following use case: Coding model loaded (i.e. top Wizard Coder) In the chat interface we check the intent of the message - if it's not related to code generation itself -> Load LorA and run the prompt. Using fine-tuned coding models for other purposes completely breaks their coding abilities and the above approach would allow to create a really good internal, universal LLM for developers.

Jeevi10 commented 5 months ago

Any plan to support Lora weights directly without needing of merging?