Gsunshine / Enjoy-Hamburger

[ICLR 2021 top 3%] Is Attention Better Than Matrix Decomposition?
GNU General Public License v3.0
324 stars 20 forks source link

Difference between the code and Eq(13) in the paper about the gradient calculation #9

Closed dywu98 closed 1 year ago

dywu98 commented 1 year ago

https://github.com/Gsunshine/Enjoy-Hamburger/blob/d9b51f6f197486df68c6e059e396520680157c08/seg_mm/mmseg/models/decode_heads/ham_head.py#L45

According to the paper, isn't the gradient of the MDs should be one-step gradient? However, the code of NMF dose not apply torch.with_on_grad() on the local_inference of NMF and _MatrixDecomposition2DBase. Could you please provide some explanation on this difference?

Gsunshine commented 1 year ago

Hi @Magic-Ha ,

Thank you for your interest in Hamburger! This is an excellent question!

The gradient steps you can use depend on the conditioning of the Jacobian matrix. For NMF in LightHam, according to my test, the 1-step gradient produces similar performances as the exact gradient. So you can definitely choose 1-step gradient for NMF to save your backdrop memory and time (will be both O(1) memory and time complexity in the backward pass). (Remove the # there.) For the optimization algorithm in VQ, however, the conditioning will usually prevent you from exploiting more gradient information there. So if you are going to choose the VQ Ham for Hamburger, it is generally better to directly use 1-step grad.

As for further analysis, you may check my paper on training more generalized implicit layers/deep equilibrium models here. To improve the gradient's conditioning and incorporate more gradient information, you may consider a ``damping'' effect that can form a descent direction compared to the exact gradient (with better conditioning) while keeping your fixed point/optimization solution unchanged.

One more time, the conditioning of Jacobian determines how exact the gradient should be. So it is not strange that inexact gradients like 1-step gradient can produce as good results as the exact gradient or even surpass it sometimes.

Embrace inexact gradients and enjoy Hamburger! Thank you for your interest again!

Zhengyang

dywu98 commented 1 year ago

Thanks for your explanation. You've been very helpful!

Gsunshine commented 1 year ago

Please feel free to ask if you have additional questions. I'm very happy to chat and explain. :)

dywu98 commented 1 year ago

Sorry about the typo lol. Maybe I click the wrong prompt. I do mean to type explanation :) . Already corrected it.

Gsunshine commented 1 year ago

Will be an interesting topic for the NLP conference, lol. Investigating and evaluating the prompt bias in pretrained language models for typewriting software. EMNLP 2023