Outsider565 / LoRA-GA

145 stars 6 forks source link

LoRA-GA Tuning Issue: Task Loss Not Working as Expected #5

Closed fclearner closed 1 month ago

fclearner commented 2 months ago

Hi, Thanks for excellent work on LoRA-GA, I am experiencing an issue while using LoRA-GA for model training. Specifically, the task loss is not decreasing as expected. I would like to seek any advice or tuning tips that might help improve this situation.

Are there recommended parameter settings or tuning strategies?

Current Parameter Settings:

init_batch_size: 2
init_iters: 4
init_config:
  mode: "gradient"  # option: "simple", "svd", "gradient"
  lora_A: "unit"  # option: "gaussian", "kaiming", "fan_out_kaiming", "xavier", "zeros", "unit", "orthogonal"
  lora_A_std: 0.01  # only needed when lora_A is "gaussian"
  lora_B: "unit"  # option: "gaussian", "kaiming", "fan_out_kaiming", "xavier", "zeros", "unit", "orthogonal"
  lora_B_std: 0.01  # only needed when lora_B is "gaussian"
  scale: "stable"  # option: "default", "stable", "unit", "normalized", "gd", "weightS"
  stable_gamma: 64  # only needed when scale is "stable"
  direction: "ArB2r"  # option: "ArBr", "A2rBr", "ArB2r"(only needed when mode is "gradient")
  dtype: "fp32"  # option: "bf16", "fp32"
  norm_clip: false  # norm clipping

this is my loss scalars:

1723421470820

it would be very helpful if you can offer some suggestions! Thanks!

fclearner commented 2 months ago

By the way, I'm not using peft, this is my codes: https://github.com/wenet-e2e/wenet/pull/2606

fclearner commented 2 months ago

seems reduce stable_gamma is useful:

1723444020640
Outsider565 commented 2 months ago

Thank you for using LoRA-GA in your program! To figure out what cause the slow convergence problem, I have a few questions regarding the your speech model, as I'm not very familiar with it:

  1. Layer Application: In the speech model, is LoRA applied to every layer? In LLMs, most layers are linear, allowing LoRA to effectively approximate the full model, leading to convergence similar to full fine-tuning. However, in other domains like Stable Diffusion, LoRA is only applied to the cross-attention layers (not the Conv Layer), which creates a performance gap. How does this apply to your speech model?
  2. LoRA Hyperparameters: Could you clarify the values for LoRA alpha, LoRA rank, and the dimensions of the linear layers? I understand that a smaller dimension in the linear layer should correspond to a smaller stable gamma. For context, in LLMs, the dimensions can be quite large (e.g., 4096). What about your model?
  3. Training Dataset Size: What is the size of your training dataset, and what level of variance does it have? If your dataset has high variance, consider increasing the sampled batch size (init_batch_size * init_iters). You can estimate variance by analyzing the statistics of the sampled gradients.
    
    def evaluate_sign_similarity(mat1:torch.Tensor, mat2:torch.Tensor):
    """
    mat1 and mat2 are two matrices of the same size
    """
    assert mat1.size() == mat2.size()
    sign_similarity = (torch.sign(mat1) == torch.sign(mat2)).sum().item() / mat1.numel()
    return sign_similarity

def evaluate_magnitude_similarity(mat1:torch.Tensor, mat2:torch.Tensor, threshold=1): """ mat1 and mat2 are two matrices of the same size """ assert mat1.size() == mat2.size() log10_diff = torch.abs(torch.log10(torch.abs(mat1) + 1e-8) - torch.log10(torch.abs(mat2) + 1e-8)) return (log10_diff < threshold).sum().item() / mat1.numel()

def evaluate_similarity(grad1:dict, grad2:dict, multiplier = 0) -> pd.DataFrame:

say grad1 contains multiplier data of grad2, so the residual of grad1 should be (grad1*multiplier - grad2)/(multiplier-1)

results = []
for key in tqdm(grad1.keys()):
    if multiplier != 0:
        grad1_residual = (grad1[key] * multiplier - grad2[key]) / (multiplier - 1)
    sign_similarity = evaluate_sign_similarity(grad1_residual, grad2[key])
    magnitude_similarity = evaluate_magnitude_similarity(grad1_residual, grad2[key])
    results.append([key, sign_similarity, magnitude_similarity])
results = pd.DataFrame(results, columns=["layer", "sign_similarity", "magnitude_similarity"])
return results

I use the script above to evaluate the variance of LLMs, and you can compare the following result with your model:

Sampled batch size  | 8 | 16 | 32 | 64 | 128 | 256
-- | -- | -- | -- | -- | -- | --
Sign similarity | 0.743 | 0.790 | 0.838 | 0.875 | 0.903 | 0.925
Magnitude similarity | 0.878 | 0.908 | 0.933 | 0.950 | 0.962 | 0.971
fclearner commented 2 months ago
  1. Layer Application

Thank you for your detailed response and for the insightful questions regarding my speech model configuration. Here are the specifics: 1、Layer Application: In my case, LoRA-GA is applied to the attention layers, specifically the cross-attention and self-attention components (q, k, v, o). This is similar to how it's applied in LLMs, where most layers are linear. However, unlike in LLMs, I haven't applied it to every layer, focusing instead on the attention mechanisms. 2、Layer Application: LoRA alpha:8 LoRA rank:8 the dimensions of the linear layers:512 This setup is smaller compared to typical LLM configurations, which can have dimensions as large as 4096. I experimented with a smaller stable gamma, like 2, which seems to work well. 3、Training Dataset Size: init_batch_size init_iters = 2 8 The training dataset size is determined by init_batch_size init_iters, which I've increased from 14 to 2*8. This change has positively impacted the model's performance, as evidenced by the loss curve:

image

Regarding the evaluation of variance, I appreciate the script you've shared for assessing sign and magnitude similarity. I will consider applying this approach to better understand the variance in my model's gradients and to potentially adjust the sampled batch size accordingly.

In conclusion, I agree with your suggestion that increasing the number of LoRA layers might be beneficial. This could help in capturing more complex patterns, thereby enhancing model performance. I'll continue to experiment with these configurations and share any further insights.

Thank you again for your support and guidance!

Outsider565 commented 2 months ago
  1. Maybe you should try to apply LoRA to all linear layers. LoRA-GA with LoRA only applied to attention layers should perform similarly to full-finetune the model while freezing all the non-attention layers, which naturally falls behind the full finetuning.
  2. Regarding the training dataset size, I mean the size of the whole dataset. If the dataset is small (similar to the few-shot regime, like 32 or 64), LoRA might be not as effective as full-finetuning.
  3. For sample batch size, I'm glad to see that you have performance improvement after increasing it. If your GPU have enough GRAM, I recommend increasing the init_batch_size instead of init_iters, which should accelerate the initialization process. In my own experience, sample batch size of 32 to 128 have the best performance.
fclearner commented 1 month ago

thanks for the help