OpenDriveLab / Vista

A Generalizable World Model for Autonomous Driving
https://vista-demo.github.io
Apache License 2.0
362 stars 16 forks source link

Question about long-horizon generation #14

Closed woodfrog closed 4 days ago

woodfrog commented 1 week ago

Hi, great work and thank you for open-sourcing! On your visualization page, you showed results of 50 frames (5s 10FPS) and 160 frames (16s 10FPS), and I have some questions regarding the long-horizon generation:

(1). What are the parameters to change for generating these variable-length videos? Is it only the n_rounds? I tried with n_rounds=6 following the doc, but the average video quality looks obviously worse than the ones on the visualization page. I wonder are there other things that we should set up?

(2). Both 50 and 160 are not 25 + 22 * (n_rounds-1), so what are the settings to generate a 16s video with FPS=10? Do you set n_rounds to 7 and take the first 160 frames, or do you change the n_frames as well?

Thank you!

Little-Podi commented 5 days ago

Hi there, thanks for your interest. Sorry for the late reply, I had an extremely busy week.

What are the parameters to change for generating these variable-length videos? Is it only the n_rounds? I tried with n_rounds=6 following the doc, but the average video quality looks obviously worse than the ones on the visualization page. I wonder are there other things that we should set up?

Yes, the length of prediction is controlled by n_rounds. There is no other special argument for this. The failure case always exists. I am planning to release a version that has been trained for a longer time, but do not expect too much. However, it is noteworthy that if the same process is applied to SVD, its results will be much worse compared with ours, especially in terms of content continuity.

Both 50 and 160 are not 25 + 22 * (n_rounds-1), so what are the settings to generate a 16s video with FPS=10? Do you set n_rounds to 7 and take the first 160 frames, or do you change the n_frames as well?

Sorry for confusion. On our demo page, the videos in the first section contain 25 + 22 (2 - 1) = 47 frames, and the videos in the second section contain 25 + 22 (7 - 1) = 157 frames. Hope these can solve your questions.

woodfrog commented 4 days ago

Hi there, thanks for your interest. Sorry for the late reply, I had an extremely busy week.

What are the parameters to change for generating these variable-length videos? Is it only the n_rounds? I tried with n_rounds=6 following the doc, but the average video quality looks obviously worse than the ones on the visualization page. I wonder are there other things that we should set up?

Yes, the length of prediction is controlled by n_rounds. There is no other special argument for this. The failure case always exists. I am planning to release a version that has been trained for a longer time, but do not expect too much. However, it is noteworthy that if the same process is applied to SVD, its results will be much worse compared with ours, especially in terms of content continuity.

Both 50 and 160 are not 25 + 22 * (n_rounds-1), so what are the settings to generate a 16s video with FPS=10? Do you set n_rounds to 7 and take the first 160 frames, or do you change the n_frames as well?

Sorry for confusion. On our demo page, the videos in the first section contain 25 + 22 (2 - 1) = 47 frames, and the videos in the second section contain 25 + 22 (7 - 1) = 157 frames. Hope these can solve your questions.

Thank you for the answers!