buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
460 stars 16 forks source link

Difference between different `objective` choices #6

Closed gunnxx closed 1 month ago

gunnxx commented 1 month ago

Can you elaborate more on the differences between [pred_noise, pred_v, pred_x0]?

Specifically, I am trying to parse through the code, especially for the video task and I realized it uses pred_v objective. pred_v calls DiffusionTransitionModel.predict_start_from_v function which looks like the same function as DiffusionTransitionModel.q_sample function in which both are just reparameterization trick (former is substraction while the latter is addition). So I am a bit lost.

buoyancy99 commented 1 month ago

Some intuition: pred_x0 cares more about low frequency information like l2 loss, and is ideal for non-image domains pred_noise cares more about high frequency information, which could be important for images.

Using standard notation of diffusion, where x_t means t-step noised version of x. Intuitively, scale of ||x - x_hat|| is often smaller for less noisy, high frequency denoise(x_1) than denoise(x_T), so pred_x0 objective puts more emphasis on denoising x_T than x_1 since T >> 1. In contrast, pred_episolon manually injects the prior to increasing the weight of loss on x_1, because the epsilon noise for both x_1 and x_T are close to normal distribution and are at similar scale, despite T >> 1.

The pred_v objective is something between these two, explained here: https://lilianweng.github.io/posts/2024-04-12-diffusion-video/