buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
626 stars 30 forks source link

Questions about the differences between "Teacher Forcing" vs "Diffusion Forcing" at inference time #24

Closed Dawn-LX closed 2 days ago

Dawn-LX commented 1 month ago

Thank you for the easy-to-follow code, but I have some questions about the differences between "Teacher Forcing" vs "Diffusion Forcing" at inference (denosing) time.

After I investigated and printed the diffusion schedule (i.e., the scheduling_matrix ) during the denoising loop, I found it seems that the denoising schedule is actually also doing teacher forcing. I'm a bit confused and not sure if my understanding is correct

Here is the .py file I investigated: https://github.com/buoyancy99/diffusion-forcing/blob/e2c4da10d3fe35105b24edbb3eaba7ba099361d7/algorithms/diffusion_forcing/df_base.py#L142 I'm interested in the application of video generation. So I print the debug info aligned with the configs in video_dmlab.yaml. I mainly printed the curr_frame, start_frame, from_noise_levels and to_noise_levels, as follows

curr_frame=2,horizon=1
    {'start_frame': 0, 'end_frame=curr_frame + horizon': 3}
    xs_pred.shape=torch.Size([3, 1, 3, 128, 128]); xs_pred[start_frame:].shape=torch.Size([3, 1, 3, 128, 128])
    from_noise_levels[:,0] :[0, 0, 10]   to_noise_levels[:,0] :[0, 0, 9]
    from_noise_levels[:,0] :[0, 0, 9]   to_noise_levels[:,0] :[0, 0, 8]
    from_noise_levels[:,0] :[0, 0, 8]   to_noise_levels[:,0] :[0, 0, 7]
    from_noise_levels[:,0] :[0, 0, 7]   to_noise_levels[:,0] :[0, 0, 6]
    from_noise_levels[:,0] :[0, 0, 6]   to_noise_levels[:,0] :[0, 0, 5]
    from_noise_levels[:,0] :[0, 0, 5]   to_noise_levels[:,0] :[0, 0, 4]
    from_noise_levels[:,0] :[0, 0, 4]   to_noise_levels[:,0] :[0, 0, 3]
    from_noise_levels[:,0] :[0, 0, 3]   to_noise_levels[:,0] :[0, 0, 2]
    from_noise_levels[:,0] :[0, 0, 2]   to_noise_levels[:,0] :[0, 0, 1]
    from_noise_levels[:,0] :[0, 0, 1]   to_noise_levels[:,0] :[0, 0, 0]
curr_frame=3,horizon=1
    {'start_frame': 0, 'end_frame=curr_frame + horizon': 4}
    xs_pred.shape=torch.Size([4, 1, 3, 128, 128]); xs_pred[start_frame:].shape=torch.Size([4, 1, 3, 128, 128])
    from_noise_levels[:,0] :[0, 0, 0, 10]   to_noise_levels[:,0] :[0, 0, 0, 9]
    from_noise_levels[:,0] :[0, 0, 0, 9]   to_noise_levels[:,0] :[0, 0, 0, 8]
    from_noise_levels[:,0] :[0, 0, 0, 8]   to_noise_levels[:,0] :[0, 0, 0, 7]
    from_noise_levels[:,0] :[0, 0, 0, 7]   to_noise_levels[:,0] :[0, 0, 0, 6]
    from_noise_levels[:,0] :[0, 0, 0, 6]   to_noise_levels[:,0] :[0, 0, 0, 5]
    from_noise_levels[:,0] :[0, 0, 0, 5]   to_noise_levels[:,0] :[0, 0, 0, 4]
    from_noise_levels[:,0] :[0, 0, 0, 4]   to_noise_levels[:,0] :[0, 0, 0, 3]
    from_noise_levels[:,0] :[0, 0, 0, 3]   to_noise_levels[:,0] :[0, 0, 0, 2]
    from_noise_levels[:,0] :[0, 0, 0, 2]   to_noise_levels[:,0] :[0, 0, 0, 1]
    from_noise_levels[:,0] :[0, 0, 0, 1]   to_noise_levels[:,0] :[0, 0, 0, 0]

...

curr_frame=15,horizon=1
    {'start_frame': 0, 'end_frame=curr_frame + horizon': 16}
    xs_pred.shape=torch.Size([16, 1, 3, 128, 128]); xs_pred[start_frame:].shape=torch.Size([16, 1, 3, 128, 128])
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
    from_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]   to_noise_levels[:,0] :[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

My finding is that during the noise reduction process, all the preceding frames are clean, as the from_noise_levels and to_noise_levels have many zeros in the left part. In this case, how is this different from teacher forcing? For the diffusion forcing (in the above debug info) and the teacher forcing (in the figure 2 from the paper), they both denoising next frame conditioned on all clean preceding frames.

If there's any misunderstanding on my part or some problems for in my debug, I would greatly appreciate your guidance.

I look forward to your response.

Dawn-LX commented 1 month ago

just FYI, I upload my debug to a public tmp repo, here: https://github.com/Dawn-LX/temp/blob/b1ecdd323e845dedf3188f8a9c9a8b68b0c7fc64/df_debug_demo.py#L6 and

https://github.com/Dawn-LX/temp/blob/b1ecdd323e845dedf3188f8a9c9a8b68b0c7fc64/df_debug_demo.py#L84

and here is my debug config,

        self.frame_stack = 1 # configurations/algorithm/df_base.yaml
        self.chunk_size = 1 # configurations/algorithm/df_base.yaml
        self.n_frames = 16 # for dmlab video dataset, configurations/dataset/video_dmlab.yaml
        self.context_frames = 2 # for dmlab video dataset, configurations/dataset/video_dmlab.yaml
        self.n_tokens = self.n_frames // self.frame_stack

        self.x_shape = (3,128,128) # refer to configurations/dataset/base_video.yaml
        self.x_stacked_shape = list(self.x_shape)
        self.x_stacked_shape[0] *= self.frame_stack

        self.clip_noise = 6.0 # configurations/algorithm/df_video.yaml
        self.device = torch.device("cpu")
buoyancy99 commented 1 month ago

Hi, teacher forcing / diffusion forcing refers to different training techniques.

The to_noise_level and from_noise_level instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.

Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases

Dawn-LX commented 1 month ago

Hi, teacher forcing / diffusion forcing refers to different training techniques.

The to_noise_level and from_noise_level instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.

Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases

So the very diagram drawn in Figure 2 only represents pyramid sampling? (I haven't carefully read the code of building the scheduling_matrix  for pyramid sampling😂) and for autoregressive sampling (cfg.scheduling_matrix == "autoregressive"), the correct approach should be denoising next frame conditioned on clean preceding frames, right?. i.e., if we use autoregressive sampling, teacher forcing and diffusion forcing have the same procedure. (this means in the paper's Exp section, the inferece procedure of diffusion forcing and the teacher forcing baseline are the same).

Dawn-LX commented 1 month ago

by the way, from my understanding, I believe horizon corresponds to the T in Algorithm 2 in the paper. also, the outer for-loop in Algorithm 2 is for m in range(scheduling_matrix.shape[0] - 1): , the inner for-loop (for t=1,2,...,T) is inside self.diffusion_model.sample_step(...), and finally the loop while curr_frame < n_frames: stands for the auto regression rollout which is not depicted in Algorithm 2

Dawn-LX commented 1 month ago

Hi, teacher forcing / diffusion forcing refers to different training techniques.

The to_noise_level and from_noise_level instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.

Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases

if so, since diffusion forcing uses independent noise level per token, while at inference time, the denoising of next token is conditioned on clean preceding tokens. Intuitively, this case (all preceding tokens are clean or have small noise levels) should encountered very rarely during training (by chance). So intuitively diffusion forcing has lower convergence speed than teacher forcing ?

Dawn-LX commented 1 month ago

Hi, teacher forcing / diffusion forcing refers to different training techniques.

The to_noise_level and from_noise_level instead control the method for sampling - be it pyramid, autoregressive etc, full-sequence.

Teacher forcing supports autoregressive sampling, while diffusion forcing supports all of them. Therefore, there isn't really any contradiction here by using autoregressive sampling for both cases

Thank you for your quick response and the highly helpful answers. 👍