buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
623 stars 30 forks source link

Context frames and generating videos from scratch with DiffForcing #29

Open michael-fuest opened 1 day ago

michael-fuest commented 1 day ago

Hi @buoyancy99 ,

Really enjoyed reading the paper.

I have a question regarding your sampling process in your video experiments. As far as I can tell, you are always using n_context_frames > 0 to do video prediction only, meaning you are starting the sampling process with n_context_frames clean, ground truth frames from a certain validation video + the remaining pure noise frames, and sampling sequences that way only. Have you also ran experiments with context length of 0? So trying to generate videos using either full sequence or pyramid sampling without conditioning on any previous frames at all with a model trained using diffusion forcing dynamics? If not, would you expect diff forcing to work well in that scenario as well?

Thank you!

buoyancy99 commented 1 day ago

yes it works

michael-fuest commented 13 hours ago

Thanks for your feedback. One more question: What about the importance of v prediction vs x0 prediction? In the paper you mention that both v prediction and your custom SNR reweighting were very important for getting good results in the video prediction experiments. What were the ablation results like when removing SNR reweight and working with x0 prediction for instance?

I am working with DiffForcing, but am unable to get good results on the face forensics dataset with x0 prediction currently. When I use constant noise levels per frame it does, but changing to diff forcing dynamics (individual random noise levels per frame) sample quality breaks down substantially. Any ideas on what I should be tuning? Maybe v_prediction is essential?