5g4s / paper

0 stars 0 forks source link

LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS #42

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2106.09685

5g4s commented 1 year ago

As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive.

We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

5g4s commented 1 year ago

Problem

The major downside of fine-tuning is that the new model contains as many parameters as in the original model.

Existing techniques often introduce inference latency (Houlsby et al., 2019; Rebuffi et al., 2017) by extending model depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Hambardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.

5g4s commented 1 year ago

We see a noticeable increase in latency when using adapters, even with a very small bottleneck dimension. image

5g4s commented 1 year ago

Approach

image

image

5g4s commented 1 year ago

We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency.

5g4s commented 1 year ago

Table 6 shows that, surprisingly, LoRA already performs competitively with a very small r (more so for { $W{q}$, $W{v}$ } than just $W_{q}$). image