OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.03k stars 269 forks source link

QLoRA support? #1251

Open limcheekin opened 1 year ago

limcheekin commented 1 year ago

Hi there,

Not sure where it is relevant here.

Is CTranslate2 going to support QLoRA? Please see the following paper for more information: https://arxiv.org/abs/2305.14314

Thanks.

tabacof commented 1 year ago

I had a good experience fine-tuning with QLoRA, but the inference speed makes it unusable in production. If CTranslate2 supported QLoRA, it'd solve a big problem!

limcheekin commented 1 year ago

Really?! Appreciate if you can share more info on the inference speed of models fine tuned with QLoRA compare to LoRA.

Thanks.

aamir-s18 commented 1 year ago

Out of interest, can't you just merge the adapter weights back into the base model and then use it with Ctranslate2?

Peft is supporting this, but haven't tried it yet.

tabacof commented 1 year ago

@limcheekin I didn't compare against traditional LoRA, I just ran QLoRA and I managed to get a model that mostly solved a task in a few hours using a regular GPU. However, the tokens/second and cost/token were both worse than using GPT3.5 Turbo.

limcheekin commented 1 year ago

Hugging face published a timely blog post for QLoRA on Falcon: https://huggingface.co/blog/falcon

vince62s commented 1 year ago

I prefer this post: https://forum.opennmt.net/t/opennmt-py-v3-2-released-plenty-of-new-features/5366 ;-)

trannhatquy commented 1 year ago

We have created a script to convert models trained with QLoRA to CTranslate2 to speed up inference here https://github.com/Actable-AI/llm-utils/blob/main/qlora2ct2/convert_qlora2_ct2.py