buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
626 stars 30 forks source link

Questions about pyramid sampling and full-sequence sampling. #12

Open Perkins729 opened 3 months ago

Perkins729 commented 3 months ago

Thank you for the latest version of the code release. When I actually trained and used different sampling strategies, I found that the effect of pyramid sampling is not as good as full_sequence_sampling. However, the paper claims that modeling causal uncertainty is better than full-sequence diffusion (for example, Diffuser, I am not particularly clear on how the actual code of Diffuser is implemented.). So, what could be the possible reasons for my experimental results? Thanks for your time !

image
buoyancy99 commented 3 months ago

Hi,

For anything related to reproducing the paper, please use the paper branch instead of the main branch. There are multiple changes we made in v1.5 code so there is no guarantee that every conclusion follows. In particular, pyramid sampling is important only in certain tasks - you will see it's important to improve the consistency of planned trajectory when action is jointly diffused causally, reflected by the need of higher uncertainty_scale needed in that setting. It's also needed in MCTG which is not implemented for my v1.5 code yet. In video code, you will see we use auto-regressive sampling by default. We found that pyramid sampling can retain good or slightly better fvd in non-causal case but not as good as autoregressive in causal cases in the video domain.

Perkins729 commented 3 months ago

I just intuitively believe that planning tasks like Maze2D require modeling of Casual Uncertainty, so pyramid sampling should theoretically be better than full-sequence diffusion (but the experiments have not been satisfactory; could it be that only with the use of MCTG can it be better than full-sequence diffusion)? We acknowledge that Diffusion Forcing is a flexible training/sampling strategy (we are amazed by your idea). However, in your v1.5 experimental conclusion, with the more powerful Transformer-based code, does Diffusion Forcing really use an architecture that lies between Diffusion and Teacher Forcing to model Casual Uncertainty effectively (we may still be more concerned with the actual effect rather than the flexibility of the training/sampling strategy)?

buoyancy99 commented 3 months ago

Yes, your intuition is likely correct and it's likely that we need MCTG & diffusing actions - if we only generate one sample, then causal uncertainty doesn't really make sense. Outside maze planning, in our development of Diffusion Forcing v2, we saw some strong results of pyramid sampling so I am confident about the importance of it across domains.

I will try to support v1.5 actively in github issues, but since rigorous responses (instead of bug fixes) require me to code and do experiments, please give the authors some time to take a break after Neurips rebuttal, and I will try my best support you on these v1.5 issues afterward.

Perkins729 commented 3 months ago

Yes, your intuition is likely correct and it's likely that we need MCTG & diffusing actions - if we only generate one sample, then causal uncertainty doesn't really make sense. Outside maze planning, in our development of Diffusion Forcing v2, we saw some strong results of pyramid sampling so I am confident about the importance of it across domains.

I will try to support v1.5 actively in github issues, but since rigorous responses (instead of bug fixes) require me to code and do experiments, please give the authors some time to take a break after Neurips rebuttal, and I will try my best support you on these v1.5 issues afterward.

Of course, thank you very much for your answers and for such meaningful work. I hope you can take some time to have a good rest. I am just curious why in our consensus, pyramid sampling makes sense, but the v1.5 experimental results did not show it, or rather, for which tasks does pyramid sampling have actual effects? Of course, I hope you rest well, and we can discuss further afterwards!

buoyancy99 commented 3 months ago

Hi,

I am wondering whether you had individually tuned the guidance_scale variable for this comparison? A random thought from yesterday reminds me that guidance strength is dependent on the overall signal-to-noise (SNR) ratio. Since these sampling schemes all have quite distinct (SNR), the guidance scale shall be individually tuned to make things more conclusive.

Another thing for those who found this page: Try our advice here and it make your general diffusion forcing work better