We present QLoRA, an efficient finetuning approach that reduces memory usageenough to finetune a 65B parameter model on a single 48GB GPU while preservingfull 16-bit finetuning task performance. QLoRA backpropagates gradients througha frozen, 4-bit quantized pretrained language model into Low RankAdapters~(LoRA). Our best model family, which we name Guanaco, outperforms allprevious openly released models on the Vicuna benchmark, reaching 99.3% of theperformance level of ChatGPT while only requiring 24 hours of finetuning on asingle GPU. QLoRA introduces a number of innovations to save memory withoutsacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that isinformation theoretically optimal for normally distributed weights (b) doublequantization to reduce the average memory footprint by quantizing thequantization constants, and (c) paged optimziers to manage memory spikes. Weuse QLoRA to finetune more than 1,000 models, providing a detailed analysis ofinstruction following and chatbot performance across 8 instruction datasets,multiple model types (LLaMA, T5), and model scales that would be infeasible torun with regular finetuning (e.g. 33B and 65B parameter models). Our resultsshow that QLoRA finetuning on a small high-quality dataset leads tostate-of-the-art results, even when using smaller models than the previousSoTA. We provide a detailed analysis of chatbot performance based on both humanand GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonablealternative to human evaluation. Furthermore, we find that current chatbotbenchmarks are not trustworthy to accurately evaluate the performance levels ofchatbots. A lemon-picked analysis demonstrates where Guanaco fails compared toChatGPT. We release all of our models and code, including CUDA kernels for4-bit training.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)