Hi, Dear Author,
I recently is reading the paper 'proxquant'. It's interesting to give an explaination of the STE and inspire me a lot to propose an alternate method for the common used STE. In my understanding, one contribution is the lazy project SGD to non-lazy project, which is descriped with Algorithm 1 (also the eq. 4).
Instead of Theta (t+1) = Theta(t) - lr delta(L, Quant(Theta(t))), eq.4 propose to use
Theta (t+1) = Quant(Theta(t) - lr delta(L, Theta(t))). where Theta(t) is full precision weight.
I'm a little confused that it seems still the same with the lazy project one. Because the Theta(t+1) is quantized one, thus the next update (for t+2), it is still in the format of:
Theta (t+2) = Quant(Theta(t+1) - lr * delta(L, Theta(t+1))) where Theta(t+1) is 'after-quantization' value rather than full precision value?
It might be wrong for my understanding. Could you please give me some hint how to unstand the eq. 4?
Hi, Dear Author, I recently is reading the paper 'proxquant'. It's interesting to give an explaination of the STE and inspire me a lot to propose an alternate method for the common used STE. In my understanding, one contribution is the lazy project SGD to non-lazy project, which is descriped with Algorithm 1 (also the eq. 4).
Instead of Theta (t+1) = Theta(t) - lr delta(L, Quant(Theta(t))), eq.4 propose to use Theta (t+1) = Quant(Theta(t) - lr delta(L, Theta(t))). where Theta(t) is full precision weight.
I'm a little confused that it seems still the same with the lazy project one. Because the Theta(t+1) is quantized one, thus the next update (for t+2), it is still in the format of:
Theta (t+2) = Quant(Theta(t+1) - lr * delta(L, Theta(t+1))) where Theta(t+1) is 'after-quantization' value rather than full precision value?
It might be wrong for my understanding. Could you please give me some hint how to unstand the eq. 4?