OpenDriveLab / Vista

[NeurIPS 2024] A Generalizable World Model for Autonomous Driving
https://vista-demo.github.io
Apache License 2.0
503 stars 33 forks source link

Questions on evaluation experiments in nuScenes validation dataset #32

Open Fengtao22 opened 4 weeks ago

Fengtao22 commented 4 weeks ago

Hi, first of all, thanks for open sourcing your work! I have three questions with respect to your video generation using nuScenes validation dataset:

  1. In your nuScenes_val.json file, there are total 5369 samples (each sample contains 25 frames). This number does not match the validation samples in nuScenes (6019 frames). Is it because you filter out the frames that does not have future (2seconds) frames? I did a calculation, there are 5951 unique validation samples in your json file, and it seems 17 video clips do not have future frames (17*4=6019-5951) among the 150 scenes.

  2. In your nuScenes_val.json file, the provided traj contains 10 elements, which I believe is the future 2 seconds ego trajectory points including its current position with respect to the first frame of that sample, in other words, the traj list format is like [x_current, y_current, x_current+0.5, y_current+0.5, x_current+1, y_current+1, x_current+1.5, y_current+1.5, x_current+2, y_current+2]. Correct me if I am wrong. Why don't we use 2.5 seconds future predictions since the default frame number is 25 and frequency is 10Hz?

  3. To get the L2 vs reward relationship, how did you calculate the L2 (using the ground truth future trajectory with a random sampled trajectory?) How many samples (25 frames a sample) do you use for getting average reward for a fixed L2 error (in Figure 10 left of your paper)?

Thanks!

Little-Podi commented 8 hours ago

Hi @Fengtao22, thanks for your interest!

In your nuScenes_val.json file, there are total 5369 samples (each sample contains 25 frames). This number does not match the validation samples in nuScenes (6019 frames). Is it because you filter out the frames that does not have future (2seconds) frames?

Yes, I did some filtering to omit invalid samples. I just published the complete processing script in #30, which results in 5369 validation samples.

Why don't we use 2.5 seconds future predictions since the default frame number is 25 and frequency is 10Hz?

You are right, the nuScenes samples are 2 seconds long, and our model predicts 25 frames at 10 Hz. However, nuScenes (incl. sweeps) is logged at 12 Hz. Therefore, to align perfectly with the model input, we take one initial frame and 2 seconds video to build a sequence of 25 frames.

To get the L2 vs reward relationship, how did you calculate the L2 (using the ground truth future trajectory with a random sampled trajectory?) How many samples (25 frames a sample) do you use for getting average reward for a fixed L2 error (in Figure 10 left of your paper)?

Sorry for the confusion, here is how I get each point in Figure 10:

Thus, for each data point in Figure 10, I go through 1500 samples from Waymo validation set. For each sample, I generate a random trajectory using the same deviation and perform reward estimation. In short, it takes 1500 x 5 x 10 = 75000 denoising steps to get each data point. The most expensive figure I have ever drawn!