alimama-creative / M3DDM-Video-Outpainting

Official repo for Hierarchical Masked 3D Diffusion Model for Video Outpainting
https://fanfanda.github.io/M3DDM/
Apache License 2.0
83 stars 6 forks source link

about the cfg in training #4

Closed xiangweifeng closed 9 months ago

xiangweifeng commented 9 months ago

Hi, Fan,. The paper (3.2.3) mentioned that: One is the context information of the video 𝑐1, and the other is the global video clip 𝑐2. We jointly train the unconditional and conditional models by randomly setting 𝑐1 and 𝑐2 to a fixed null value βˆ… with probabilities 𝑝1 and 𝑝2. I cannot find the p1 and p2, can you provide the reference values。

In 3.2.1, the paper mentioned that: The "mask all" strategy enables the model to perform unconditional generation, which allows us to adopt the classifier-free guidance [20] technique during the inference phase. During the inference phase, the reason of cfg can be used in reference stage is the trainning strategy that are described in 3.2.3 or the "mask all"?

fanfanda commented 9 months ago

Sorry, the details here were not clearly described in the paper. We used $p_1=0.35$, $p_2=0.1$ during training.

The reason that classifier-free guidance can be used in the inference stage is that we employed joint training during the training process (unconditional and conditional). Section 3.2.3 describes our inference strategy.

xiangweifeng commented 9 months ago

Hi, Fan, In your mask strategy, you have designed "mask all" some images for cfg. Then during the training, as mentioned before, do you " mask all the video" for cfg again?

fanfanda commented 9 months ago

Hi, Fan, In your mask strategy, you have designed "mask all" some images for cfg. Then during the training, as mentioned before, do you " mask all the video" for cfg again?

In our inference stage, we employed unconditional generation; therefore, during the training phase, we need a certain proportion of batches to not receive context information, that is, we implement the 'mask all' strategy.

xiangweifeng commented 8 months ago

Sorry, the details here were not clearly described in the paper. We used p1=0.35, p2=0.1 during training.

The reason that classifier-free guidance can be used in the inference stage is that we employed joint training during the training process (unconditional and conditional). Section 3.2.3 describes our inference strategy.

Hi Fan, does the p1=0.35 is for context information of the video 𝑐1, and p2=0.9 is for global video?

fanfanda commented 8 months ago

We used $p_1=0.35$ (context information not given), $p_2=0.1$ (global video not given) during training.