Closed fclearner closed 1 month ago
By the way, I'm not using peft, this is my codes: https://github.com/wenet-e2e/wenet/pull/2606
seems reduce stable_gamma is useful:
Thank you for using LoRA-GA in your program! To figure out what cause the slow convergence problem, I have a few questions regarding the your speech model, as I'm not very familiar with it:
def evaluate_sign_similarity(mat1:torch.Tensor, mat2:torch.Tensor):
"""
mat1 and mat2 are two matrices of the same size
"""
assert mat1.size() == mat2.size()
sign_similarity = (torch.sign(mat1) == torch.sign(mat2)).sum().item() / mat1.numel()
return sign_similarity
def evaluate_magnitude_similarity(mat1:torch.Tensor, mat2:torch.Tensor, threshold=1): """ mat1 and mat2 are two matrices of the same size """ assert mat1.size() == mat2.size() log10_diff = torch.abs(torch.log10(torch.abs(mat1) + 1e-8) - torch.log10(torch.abs(mat2) + 1e-8)) return (log10_diff < threshold).sum().item() / mat1.numel()
def evaluate_similarity(grad1:dict, grad2:dict, multiplier = 0) -> pd.DataFrame:
results = []
for key in tqdm(grad1.keys()):
if multiplier != 0:
grad1_residual = (grad1[key] * multiplier - grad2[key]) / (multiplier - 1)
sign_similarity = evaluate_sign_similarity(grad1_residual, grad2[key])
magnitude_similarity = evaluate_magnitude_similarity(grad1_residual, grad2[key])
results.append([key, sign_similarity, magnitude_similarity])
results = pd.DataFrame(results, columns=["layer", "sign_similarity", "magnitude_similarity"])
return results
I use the script above to evaluate the variance of LLMs, and you can compare the following result with your model:
Sampled batch size | 8 | 16 | 32 | 64 | 128 | 256
-- | -- | -- | -- | -- | -- | --
Sign similarity | 0.743 | 0.790 | 0.838 | 0.875 | 0.903 | 0.925
Magnitude similarity | 0.878 | 0.908 | 0.933 | 0.950 | 0.962 | 0.971
- Layer Application
Thank you for your detailed response and for the insightful questions regarding my speech model configuration. Here are the specifics: 1、Layer Application: In my case, LoRA-GA is applied to the attention layers, specifically the cross-attention and self-attention components (q, k, v, o). This is similar to how it's applied in LLMs, where most layers are linear. However, unlike in LLMs, I haven't applied it to every layer, focusing instead on the attention mechanisms. 2、Layer Application: LoRA alpha:8 LoRA rank:8 the dimensions of the linear layers:512 This setup is smaller compared to typical LLM configurations, which can have dimensions as large as 4096. I experimented with a smaller stable gamma, like 2, which seems to work well. 3、Training Dataset Size: init_batch_size init_iters = 2 8 The training dataset size is determined by init_batch_size init_iters, which I've increased from 14 to 2*8. This change has positively impacted the model's performance, as evidenced by the loss curve:
Regarding the evaluation of variance, I appreciate the script you've shared for assessing sign and magnitude similarity. I will consider applying this approach to better understand the variance in my model's gradients and to potentially adjust the sampled batch size accordingly.
In conclusion, I agree with your suggestion that increasing the number of LoRA layers might be beneficial. This could help in capturing more complex patterns, thereby enhancing model performance. I'll continue to experiment with these configurations and share any further insights.
Thank you again for your support and guidance!
thanks for the help
Hi, Thanks for excellent work on LoRA-GA, I am experiencing an issue while using LoRA-GA for model training. Specifically, the task loss is not decreasing as expected. I would like to seek any advice or tuning tips that might help improve this situation.
Are there recommended parameter settings or tuning strategies?
Current Parameter Settings:
this is my loss scalars:
it would be very helpful if you can offer some suggestions! Thanks!