Closed dywu98 closed 1 year ago
Hi @Magic-Ha ,
Thank you for your interest in Hamburger! This is an excellent question!
The gradient steps you can use depend on the conditioning of the Jacobian matrix. For NMF in LightHam, according to my test, the 1-step gradient produces similar performances as the exact gradient. So you can definitely choose 1-step gradient for NMF to save your backdrop memory and time (will be both O(1) memory and time complexity in the backward pass). (Remove the # there.) For the optimization algorithm in VQ, however, the conditioning will usually prevent you from exploiting more gradient information there. So if you are going to choose the VQ Ham for Hamburger, it is generally better to directly use 1-step grad.
As for further analysis, you may check my paper on training more generalized implicit layers/deep equilibrium models here. To improve the gradient's conditioning and incorporate more gradient information, you may consider a ``damping'' effect that can form a descent direction compared to the exact gradient (with better conditioning) while keeping your fixed point/optimization solution unchanged.
One more time, the conditioning of Jacobian determines how exact the gradient should be. So it is not strange that inexact gradients like 1-step gradient can produce as good results as the exact gradient or even surpass it sometimes.
Embrace inexact gradients and enjoy Hamburger! Thank you for your interest again!
Zhengyang
Thanks for your explanation. You've been very helpful!
Please feel free to ask if you have additional questions. I'm very happy to chat and explain. :)
Sorry about the typo lol. Maybe I click the wrong prompt. I do mean to type explanation :) . Already corrected it.
Will be an interesting topic for the NLP conference, lol. Investigating and evaluating the prompt bias in pretrained language models for typewriting software. EMNLP 2023
https://github.com/Gsunshine/Enjoy-Hamburger/blob/d9b51f6f197486df68c6e059e396520680157c08/seg_mm/mmseg/models/decode_heads/ham_head.py#L45
According to the paper, isn't the gradient of the MDs should be one-step gradient? However, the code of NMF dose not apply
torch.with_on_grad()
on thelocal_inference
of NMF and_MatrixDecomposition2DBase
. Could you please provide some explanation on this difference?