Closed Dawn-LX closed 2 days ago
just FYI, I upload my debug to a public tmp repo, here: https://github.com/Dawn-LX/temp/blob/b1ecdd323e845dedf3188f8a9c9a8b68b0c7fc64/df_debug_demo.py#L6 and
https://github.com/Dawn-LX/temp/blob/b1ecdd323e845dedf3188f8a9c9a8b68b0c7fc64/df_debug_demo.py#L84
and here is my debug config,
self.frame_stack = 1 # configurations/algorithm/df_base.yaml
self.chunk_size = 1 # configurations/algorithm/df_base.yaml
self.n_frames = 16 # for dmlab video dataset, configurations/dataset/video_dmlab.yaml
self.context_frames = 2 # for dmlab video dataset, configurations/dataset/video_dmlab.yaml
self.n_tokens = self.n_frames // self.frame_stack
self.x_shape = (3,128,128) # refer to configurations/dataset/base_video.yaml
self.x_stacked_shape = list(self.x_shape)
self.x_stacked_shape[0] *= self.frame_stack
self.clip_noise = 6.0 # configurations/algorithm/df_video.yaml
self.device = torch.device("cpu")
Hi, teacher forcing / diffusion forcing refers to different training techniques.
The to_noise_level
and from_noise_level
instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.
Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases
Hi, teacher forcing / diffusion forcing refers to different training techniques.
The
to_noise_level
andfrom_noise_level
instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases
So the very diagram drawn in Figure 2 only represents pyramid sampling? (I haven't carefully read the code of building the scheduling_matrix
for pyramid sampling😂)
and for autoregressive sampling (cfg.scheduling_matrix == "autoregressive"), the correct approach should be denoising next frame conditioned on clean preceding frames, right?.
i.e., if we use autoregressive sampling, teacher forcing and diffusion forcing have the same procedure. (this means in the paper's Exp section, the inferece procedure of diffusion forcing and the teacher forcing baseline are the same).
by the way, from my understanding, I believe horizon
corresponds to the T
in Algorithm 2 in the paper.
also, the outer for-loop in Algorithm 2 is for m in range(scheduling_matrix.shape[0] - 1):
, the inner for-loop (for t=1,2,...,T) is inside self.diffusion_model.sample_step(...)
, and finally the loop while curr_frame < n_frames:
stands for the auto regression rollout which is not depicted in Algorithm 2
Hi, teacher forcing / diffusion forcing refers to different training techniques.
The
to_noise_level
andfrom_noise_level
instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases
if so, since diffusion forcing uses independent noise level per token, while at inference time, the denoising of next token is conditioned on clean preceding tokens. Intuitively, this case (all preceding tokens are clean or have small noise levels) should encountered very rarely during training (by chance). So intuitively diffusion forcing has lower convergence speed than teacher forcing ?
Hi, teacher forcing / diffusion forcing refers to different training techniques.
The
to_noise_level
andfrom_noise_level
instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases
Thank you for your quick response and the highly helpful answers. 👍
Thank you for the easy-to-follow code, but I have some questions about the differences between "Teacher Forcing" vs "Diffusion Forcing" at inference (denosing) time.
After I investigated and printed the diffusion schedule (i.e., the
scheduling_matrix
) during the denoising loop, I found it seems that the denoising schedule is actually also doing teacher forcing. I'm a bit confused and not sure if my understanding is correctHere is the .py file I investigated: https://github.com/buoyancy99/diffusion-forcing/blob/e2c4da10d3fe35105b24edbb3eaba7ba099361d7/algorithms/diffusion_forcing/df_base.py#L142 I'm interested in the application of video generation. So I print the debug info aligned with the configs in
video_dmlab.yaml
. I mainly printed thecurr_frame
,start_frame
,from_noise_levels
andto_noise_levels
, as follows...
My finding is that during the noise reduction process, all the preceding frames are clean, as the
from_noise_levels
andto_noise_levels
have many zeros in the left part. In this case, how is this different from teacher forcing? For the diffusion forcing (in the above debug info) and the teacher forcing (in the figure 2 from the paper), they both denoising next frame conditioned on all clean preceding frames.If there's any misunderstanding on my part or some problems for in my debug, I would greatly appreciate your guidance.
I look forward to your response.