OpenDriveLab / Vista

A Generalizable World Model for Autonomous Driving
https://vista-demo.github.io
Apache License 2.0
426 stars 22 forks source link

Why the triangle CFG scheme works for driving videos? #18

Closed xjixzz closed 2 weeks ago

xjixzz commented 3 weeks ago

SV3D adapts the triangle CFG scheme because the camera trajectory loops around a 3D object. But the trajectory of a driving video is not a loop, why does it still work?

Little-Podi commented 3 weeks ago

On the one hand, our model utilizes the last few frames of each clip as condition frames to predict long-term futures in an autoregressive manner. On the other hand, classifier-free guidance can result in slightly saturated generation. Therefore, if we use a constant CFG for long-horizon prediction, the issue of over-saturation will accumulate rapidly, as shown in the paper. The triangle CFG mitigates this problem by assigning moderate guidance scales to the frames that will be used as conditions in the next prediction round. Thanks to sufficient temporal interaction, the high-quality intermediate frames can also enhance the last few frames, which have relatively low guidance scales.