buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
494 stars 19 forks source link

About Casual Uncertainty Modeling #10

Closed Perkins729 closed 1 month ago

Perkins729 commented 1 month ago

I would also like to know if using an independent noise level during training could potentially disrupt the modeling of causal uncertainty. What if, when sampling the noise level, the noise level of the past tokens is higher than that of the future tokens?

buoyancy99 commented 1 month ago

In fact the case you mentioned is exactly why we want independent noise! Notice that at training time we use independent noise but at sampling time we can choose whatever noise schedule we like, be it full-squence or pyramid or autoregressive. If you want past tokens to have higher noise, you can prompt the model to do so!

Perkins729 commented 1 month ago

In fact the case you mentioned is exactly why we want independent noise! Notice that at training time we use independent noise but at sampling time we can choose whatever noise schedule we like, be it full-squence or pyramid or autoregressive. If you want past tokens to have higher noise, you can prompt the model to do so!

Have you considered using a pyramid during the training process?

buoyancy99 commented 1 month ago

This is definitely worth more exploration!

There are some prior works that did something similar, such as AR-diffusion and rolling diffusion so it's definitely possible. However, what we found is that people often want different sampling schemes / uncertainty scale at test time. e.g. Many video gen companies want to support key framing, not just first frame conditioning. Or, you may want to use our other sampling schemes for different purposes, so independent noise seems the most robust things for all applications.

buoyancy99 commented 1 month ago

Update: I reimplemented AR-diffusion myself, and the preliminary result shows it's not very good, although video looks visually okay. Rolling-diffusion with noncausal architecture also seems worse off in numbers, which kinda aligns with their only non-toy dataset result on kinetics 600