Effect of clamping pred_x0 on gradients of loss with respect to xt

jamesheald commented 1 year ago

In O_DDIMSampler of ddim.py, you clamp pred_x0 between -1 and 1 (line 470 in the function process_xstart), but you calculate gradients of a loss function with respect to xt (line 613) and this loss depends on xt via pred_x0. The clamping operation presumably destroys all gradient information? I would expect this to undermine the performance of the method. Am I missing something?

ghzhang233 commented 1 year ago

Hi, James, the gradient of pred_x0.clamp(-1, 1) with respect to pred_x0 would be 1 if pred_x0 $\in [-1, 1]$ otherwise 0 for each dimension, so this would not destroy all gradient information unless all dimensions of pred_x0 are clamped ($>1$ or $<-1$). It might be interesting to see what would happen without this clamp operation, but it might not go well as I recall. Let me know if there is anything I failed to make clear.

jamesheald commented 1 year ago

Thanks for your quick response. I guess I don't have a good feel for how often clamp is being called. Obviously, this isn't an issue if it's not getting called much, but presumably will be increasingly problematic the more clamp is called.

It's interesting to note that in the equation for pred_x0:

sqrt_recip_alphas_cumprod x_t - sqrt_recipm1_alphas_cumprod eps

x_t appears twice, once directly as x_t (first term above) and once indirectly via epsilon_theta(x_t) (second term above). In principle, stating that x0 is bounded between -1 and 1 is equivalent to saying that epsilon is bounded (with bounds that depend on x_t), and so rather than clipping pred_x0, one could instead clip epsilon_theta(x_t); this has the advantage that even if epsilon_theta(x_t) is clipped, gradients can still flow through the other x_t that appears in the equation for pred_x0. I was thinking of doing this in my own implementation. Do you agree that this makes sense and is potentially preferable to clipping pred_x0?

Thanks

ghzhang233 commented 1 year ago

Hi James,

I appreciate your interest in our work and your suggestion. I believe it has potential, but I am not entirely certain. It feels more like an empirical study to me, and both possibilities seem plausible to me.

From some sense, even such a constraint pred_x0 $\in [-1, 1]$ could be unnecessary. However, it seems to be quite useful in improving the performances (as I recall), so most open-sourced implementations for diffusion models adopt such a constraint. This level of detail is usually not discussed in papers, so choosing between the options, in my opinion, depends largely on the empirical results.

Thanks

jamesheald commented 1 year ago

Yes, from my experience clamping pred_x0, though not required theoretically, is absolutely critical in practice. Frustratingly this things are glossed over or not even mentioned at all in papers, as you say. I'll try both and see if I notice a difference. Thanks again for your quick responses.

ghzhang233 commented 1 year ago

Hi James, upon reconsidering, I feel that I actually personally prefer your version, but I’m unsure about how large an impact it might have on the empirical results.

UCSB-NLP-Chang / CoPaint

Effect of clamping pred_x0 on gradients of loss with respect to xt #6