Open michael-fuest opened 1 day ago
yes it works
Thanks for your feedback. One more question: What about the importance of v prediction vs x0 prediction? In the paper you mention that both v prediction and your custom SNR reweighting were very important for getting good results in the video prediction experiments. What were the ablation results like when removing SNR reweight and working with x0 prediction for instance?
I am working with DiffForcing, but am unable to get good results on the face forensics dataset with x0 prediction currently. When I use constant noise levels per frame it does, but changing to diff forcing dynamics (individual random noise levels per frame) sample quality breaks down substantially. Any ideas on what I should be tuning? Maybe v_prediction is essential?
Hi @buoyancy99 ,
Really enjoyed reading the paper.
I have a question regarding your sampling process in your video experiments. As far as I can tell, you are always using
n_context_frames > 0
to do video prediction only, meaning you are starting the sampling process withn_context_frames
clean, ground truth frames from a certain validation video + the remaining pure noise frames, and sampling sequences that way only. Have you also ran experiments with context length of 0? So trying to generate videos using either full sequence or pyramid sampling without conditioning on any previous frames at all with a model trained using diffusion forcing dynamics? If not, would you expect diff forcing to work well in that scenario as well?Thank you!