Outsider565 / LoRA-GA

117 stars 5 forks source link

performance on vision models like vit or stable diffusion #1

Open zhch-sun opened 1 month ago

zhch-sun commented 1 month ago

thanks for your awesome work!
I was wondering if you got any results on vision models like vit or stable diffusion?

Outsider565 commented 1 month ago

Thank you for your interest in our work!

We have conducted preliminary experiments with Stable Diffusion 1.5 on COCO and style transfer datasets. While we haven't explored ViT yet, our findings with Stable Diffusion indicate that LoRA-GA converges significantly faster. However, the FID metric shows only a marginal improvement over standard LoRA. This could be attributed to the fact that both methods converge well after training for 50 epochs, with LoRA-GA demonstrating substantial improvement in the initial few epochs.

We will be updating our results in the next version on arXiv, so stay tuned for more detailed insights.

zhch-sun commented 1 month ago

I don't know if you have ever experimented with this, but when finetuning sd lora using only a few images, would lora-ga perform better under this setting?

Outsider565 commented 1 month ago

We have conducted preliminary experiments with Stable Diffusion 1.5 on COCO and style transfer datasets

The style transfer dataset we tried has 32-64 images per class. LoRA-GA performs better at the first 10-20 epochs (w.r.t. training loss and generated image quality), but after that, both methods seem to be similar, especially after 100 epochs. LLM SD style transfer
Training Epochs Mostly 1 50 or more
Training Data 100k sample or 100M+ tokens 32-64 images

Here's my thought: The SD style transfer tasks are a lot easier than the LLM ones. Tuning a few images with many epochs gives LoRA many chances to gradually optimize (maybe at the first few epochs, standard LoRA is "trying to find a good initialization"), while the LLM tasks only give one chance. As a result, even with suboptimal initialization, after training for dozens of epochs, standard LoRA can converge to an optimum space.

zhch-sun commented 1 month ago

Since your algorithm converges faster, is it possible that using your algorithm could alleviate the catastrophic forgetting problem during SFT, i.e., retain more knowledge from the pre-trained model (such as by using early stopping)

Outsider565 commented 1 month ago

Maybe you can check out this paper. I assume LoRA-GA should forget less than full-finetuning, but I have no idea whether it will forget less or more than standard LoRA. Feel free to try it in your setting! Also if you have some good results, I'm happy to discuss it. Feel free to connect with me.