Open limcheekin opened 1 year ago
I had a good experience fine-tuning with QLoRA, but the inference speed makes it unusable in production. If CTranslate2 supported QLoRA, it'd solve a big problem!
Really?! Appreciate if you can share more info on the inference speed of models fine tuned with QLoRA compare to LoRA.
Thanks.
Out of interest, can't you just merge the adapter weights back into the base model and then use it with Ctranslate2?
Peft is supporting this, but haven't tried it yet.
@limcheekin I didn't compare against traditional LoRA, I just ran QLoRA and I managed to get a model that mostly solved a task in a few hours using a regular GPU. However, the tokens/second and cost/token were both worse than using GPT3.5 Turbo.
Hugging face published a timely blog post for QLoRA on Falcon: https://huggingface.co/blog/falcon
I prefer this post: https://forum.opennmt.net/t/opennmt-py-v3-2-released-plenty-of-new-features/5366 ;-)
We have created a script to convert models trained with QLoRA to CTranslate2 to speed up inference here https://github.com/Actable-AI/llm-utils/blob/main/qlora2ct2/convert_qlora2_ct2.py
Hi there,
Not sure where it is relevant here.
Is CTranslate2 going to support QLoRA? Please see the following paper for more information: https://arxiv.org/abs/2305.14314
Thanks.