Experiment applying MARVEL

Yoogeonhui commented 1 year ago

https://openreview.net/forum?id=lq62uWRJjiY

Recent progress applying MARVEL in LM showed better performance compared to LoRA. I didn't read the paper thoroughly but this paper also seems to be applicable to the diffusion process.

cloneofsimo commented 1 year ago

Wow, this looks very similar to the idea @brian6091 had. Might want to have a look here!

cloneofsimo commented 1 year ago

Some random initial thoughts on the idea

So why not just optimize P, Q on the Stiefel manifold, which is the idea I had in mind? Seems like this mimic-retraction-like approach might not be optimal way of doing things. Many alternative approaches could be readily made here, It looks really awesome
How they allocate diagonals to correspond to budget, and perform discrete optimization there sounds amazing. It also seems to have an effect on generalization as well? Considering how well they perform compared to full-fine tuning. Results are clearly very impressive.

End rank distribution results demonstrated,

This is incredibly fascinating. One can imagine that this is essentially saving a LOT of budget, as one "has" to use rank 12 for the above case to get the above model. deeper the layer, the "larger" the rank needs which definitely makes sense in very classical feature representation perspective.

This is also demonstrated by figure 1,

Pruning with diagonal term is really simple, and according to them, it consistently outperforms naive LoRA and has similar performance to MARVEL (algorithm 1.) MARVEL is difficult to implement, but singular value based proximal operator is simple, so $S_i = |\lambda_i|$ might be a good starting point.

cloneofsimo commented 1 year ago

I actually find this very good, since we already needed to work on scaling part as well. So saving into LoRA format would be the common part, which means saving format needed to be fixed anyways. We can pull this off extremely nicely by just inheriting LoRA class and reparametrizing their form, and modifying the save function. Might as well make it compatible with diffusers' format all at once. Getting a decent performance with this modification sounds almost certain.

brian6091 commented 1 year ago

Thanks for the paper! @cloneofsimo this should mix nicely with training/varying rank by block. I'm having a closer read of the paper.

cloneofsimo / lora

Experiment applying MARVEL #154