Closed Perkins729 closed 1 month ago
In fact the case you mentioned is exactly why we want independent noise! Notice that at training time we use independent noise but at sampling time we can choose whatever noise schedule we like, be it full-squence or pyramid or autoregressive. If you want past tokens to have higher noise, you can prompt the model to do so!
In fact the case you mentioned is exactly why we want independent noise! Notice that at training time we use independent noise but at sampling time we can choose whatever noise schedule we like, be it full-squence or pyramid or autoregressive. If you want past tokens to have higher noise, you can prompt the model to do so!
Have you considered using a pyramid during the training process?
This is definitely worth more exploration!
There are some prior works that did something similar, such as AR-diffusion and rolling diffusion so it's definitely possible. However, what we found is that people often want different sampling schemes / uncertainty scale at test time. e.g. Many video gen companies want to support key framing, not just first frame conditioning. Or, you may want to use our other sampling schemes for different purposes, so independent noise seems the most robust things for all applications.
Update: I reimplemented AR-diffusion myself, and the preliminary result shows it's not very good, although video looks visually okay. Rolling-diffusion with noncausal architecture also seems worse off in numbers, which kinda aligns with their only non-toy dataset result on kinetics 600
I would also like to know if using an independent noise level during training could potentially disrupt the modeling of causal uncertainty. What if, when sampling the noise level, the noise level of the past tokens is higher than that of the future tokens?