Open Yoogeonhui opened 1 year ago
Wow, this looks very similar to the idea @brian6091 had. Might want to have a look here!
Some random initial thoughts on the idea
So why not just optimize P, Q on the Stiefel manifold, which is the idea I had in mind? Seems like this mimic-retraction-like approach might not be optimal way of doing things. Many alternative approaches could be readily made here, It looks really awesome
How they allocate diagonals to correspond to budget, and perform discrete optimization there sounds amazing. It also seems to have an effect on generalization as well? Considering how well they perform compared to full-fine tuning. Results are clearly very impressive.
This is incredibly fascinating. One can imagine that this is essentially saving a LOT of budget, as one "has" to use rank 12 for the above case to get the above model. deeper the layer, the "larger" the rank needs which definitely makes sense in very classical feature representation perspective.
This is also demonstrated by figure 1,
I actually find this very good, since we already needed to work on scaling part as well. So saving into LoRA format would be the common part, which means saving format needed to be fixed anyways. We can pull this off extremely nicely by just inheriting LoRA class and reparametrizing their form, and modifying the save function. Might as well make it compatible with diffusers' format all at once. Getting a decent performance with this modification sounds almost certain.
Thanks for the paper! @cloneofsimo this should mix nicely with training/varying rank by block. I'm having a closer read of the paper.
https://openreview.net/forum?id=lq62uWRJjiY
Recent progress applying MARVEL in LM showed better performance compared to LoRA. I didn't read the paper thoroughly but this paper also seems to be applicable to the diffusion process.