Why the triangle CFG scheme works for driving videos?

On the one hand, our model utilizes the last few frames of each clip as condition frames to predict long-term futures in an autoregressive manner. On the other hand, classifier-free guidance can result in slightly saturated generation. Therefore, if we use a constant CFG for long-horizon prediction, the issue of over-saturation will accumulate rapidly, as shown in the paper. The triangle CFG mitigates this problem by assigning moderate guidance scales to the frames that will be used as conditions in the next prediction round. Thanks to sufficient temporal interaction, the high-quality intermediate frames can also enhance the last few frames, which have relatively low guidance scales.

OpenDriveLab / Vista

Why the triangle CFG scheme works for driving videos? #18