DoRA: Weight-Decomposed Low-Rank Adaptation
[ICML2024 (Oral)]
The Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation [ICML2024 (Oral, acceptance rate: 1.5%)].
Shih-Yang Liu*, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen
(*Work done during the internship at NVIDIA Research)
[Paper
] [Website
] [NV Blog
] [BibTeX
]
DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to minimize the number of trainable parameters efficiently. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.
For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.
DoRA is now supported by the Huggingface PEFT package. You can install the PEFT package using
pip install git+https://github.com/huggingface/peft.git -q
After PEFT is installed, you can simply set the use_dora
argument of LoraConfig()
to True
for applying DoRA.
An example could be as follows:
from peft import LoraConfig
# Initialize DoRA configuration
config = (
use_dora=True, ...
)
Please refer to the official documentation for more details.
You can also toy with DoRA on finetuning diffusion models. See huggingface/diffusers. Another good tutorial would be this Colab notebook from Linoy Tsaban.
In general, DoRA finetuning on diffusion models is still experimental and is likely to require different hyperparameter values to perform best compared to LoRA.
Specifically, people have noticed 2 differences to take into account in your training:
- LoRA seem to converge faster than DoRA (so a set of parameters that may lead to overfitting when training a LoRA may be working well for a DoRA)
- DoRA quality superior to LoRA especially in lower ranks: The difference in quality of DoRA of rank 8 and LoRA of rank 8 appears to be more significant than when training ranks of 32 or 64 for example.
Example From Linoy Tsaban(Images generated by DoRA are on the left and LoRA on the right):
Example From merve:
[!NOTE] 💡 While fine-tuning with DoRA, utilizing the configuration of LoRA can already achieve better results most of the time, achieving optimal performance compared to LoRA still requires adjustments to the hyperparameters.
We suggest starting with a slightly lower learning rate than that of LoRA, and users may also experiment with varying LoRA dropout ratios.
User may also start with half of the rank of the LoRA configuration which oftentimes can already result in comparable or even superior accuracy compared to that of LoRA.
This repo contains four directories:
./commonsense_reasoning
contains the code to finetune LLaMA-7B/13B using DoRA on the commonsense reasoning tasks. This directory is modified based on LLM-Adapter.
./instruction_tuning_dvora
contains the code to finetune LLaMA-7B and LLaMA2-7B using DoRA and DVoRA (DoRA+VeRA) with the cleaned Alpaca instruction tuning dataset. This directory is modified based on VeRA.
./image_video_text_understanding
contains the code to finetune VL-BART using DoRA for the image/video-text understanding tasks. This directory is modified based on VL-Adapter.
./visual_instruction_tuning
contains the code to finetune LLaVA-1.5-7B on the visual instruction tuning tasks with DoRA. This directory is modified based on LLaVA.
Model | r | BoolQ | PIQA | SIQA | HellaS | WinoG | ARC-e | ARC-c | OBQA | Average |
---|---|---|---|---|---|---|---|---|---|---|
LLaMA-7B-LoRA | 32 | 67.5 | 80.8 | 78.2 | 83.4 | 80.4 | 78.0 | 62.6 | 79.1 | 76.3 |
LLaMA-7B-DoRA(ours) | 16 | 70.0 | 82.6 | 79.7 | 83.2 | 80.6 | 80.6 | 65.4 | 77.6 | 77.5 |
LLaMA-7B-DoRA(ours) | 32 | 69.7 | 83.4 | 78.6 | 87.2 | 81.0 | 81.9 | 66.2 | 79.2 | 78.4 |
LLaMA2-7B-LoRA | 32 | 69.8 | 79.9 | 79.5 | 83.6 | 82.6 | 79.8 | 64.7 | 81.0 | 77.6 |
LLaMA2-7B-DoRA(ours) | 16 | 72.0 | 83.1 | 79.9 | 89.1 | 83.0 | 84.5 | 71.0 | 81.2 | 80.5 |
LLaMA2-7B-DoRA(ours) | 32 | 71.8 | 83.7 | 76.0 | 89.1 | 82.6 | 83.7 | 68.2 | 82.4 | 79.7 |
LLaMA3-8B-LoRA | 32 | 70.8 | 85.2 | 79.9 | 91.7 | 84.3 | 84.2 | 71.2 | 79.0 | 80.8 |
LLaMA3-8B-DoRA(ours) | 16 | 74.5 | 88.8 | 80.3 | 95.5 | 84.7 | 90.1 | 79.1 | 87.2 | 85.0 |
LLaMA3-8B-DoRA(ours) | 32 | 74.6 | 89.3 | 79.9 | 95.5 | 85.6 | 90.5 | 80.4 | 85.8 | 85.2 |
Shih-Yang Liu: shihyangl@nvidia.com or sliuau@connect.ust.hk
If you find DoRA useful, please consider giving a star and citation:
@article{liu2024dora,
title={DoRA: Weight-Decomposed Low-Rank Adaptation},
author={Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung},
journal={arXiv preprint arXiv:2402.09353},
year={2024}
}
Copyright © 2024, NVIDIA Corporation. All rights reserved.
This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.