Confuse about the eq. (4) in the paper

Hi, Dear Author, I recently is reading the paper 'proxquant'. It's interesting to give an explaination of the STE and inspire me a lot to propose an alternate method for the common used STE. In my understanding, one contribution is the lazy project SGD to non-lazy project, which is descriped with Algorithm 1 (also the eq. 4).

Instead of Theta (t+1) = Theta(t) - lr delta(L, Quant(Theta(t))), eq.4 propose to use Theta (t+1) = Quant(Theta(t) - lr delta(L, Theta(t))). where Theta(t) is full precision weight.

I'm a little confused that it seems still the same with the lazy project one. Because the Theta(t+1) is quantized one, thus the next update (for t+2), it is still in the format of:

Theta (t+2) = Quant(Theta(t+1) - lr * delta(L, Theta(t+1))) where Theta(t+1) is 'after-quantization' value rather than full precision value?

It might be wrong for my understanding. Could you please give me some hint how to unstand the eq. 4?

allenbai01 / ProxQuant

Confuse about the eq. (4) in the paper #1