Open Fengtao22 opened 2 months ago
Hi @Fengtao22, thanks for your interest!
In your nuScenes_val.json file, there are total 5369 samples (each sample contains 25 frames). This number does not match the validation samples in nuScenes (6019 frames). Is it because you filter out the frames that does not have future (2seconds) frames?
Yes, I did some filtering to omit invalid samples. I just published the complete processing script in #30, which results in 5369 validation samples.
Why don't we use 2.5 seconds future predictions since the default frame number is 25 and frequency is 10Hz?
You are right, the nuScenes samples are 2 seconds long, and our model predicts 25 frames at 10 Hz. However, nuScenes (incl. sweeps) is logged at 12 Hz. Therefore, to align perfectly with the model input, we take one initial frame and 2 seconds video to build a sequence of 25 frames.
To get the L2 vs reward relationship, how did you calculate the L2 (using the ground truth future trajectory with a random sampled trajectory?) How many samples (25 frames a sample) do you use for getting average reward for a fixed L2 error (in Figure 10 left of your paper)?
Sorry for the confusion, here is how I get each point in Figure 10:
Thus, for each data point in Figure 10, I go through 1500 samples from Waymo validation set. For each sample, I generate a random trajectory using the same deviation and perform reward estimation. In short, it takes 1500 x 5 x 10 = 75000 denoising steps to get each data point. The most expensive figure I have ever drawn!
Hi @Fengtao22, thanks for your interest!
In your nuScenes_val.json file, there are total 5369 samples (each sample contains 25 frames). This number does not match the validation samples in nuScenes (6019 frames). Is it because you filter out the frames that does not have future (2seconds) frames?
Yes, I did some filtering to omit invalid samples. I just published the complete processing script in #30, which results in 5369 validation samples.
Why don't we use 2.5 seconds future predictions since the default frame number is 25 and frequency is 10Hz?
You are right, the nuScenes samples are 2 seconds long, and our model predicts 25 frames at 10 Hz. However, nuScenes (incl. sweeps) is logged at 12 Hz. Therefore, to align perfectly with the model input, we take one initial frame and 2 seconds video to build a sequence of 25 frames.
To get the L2 vs reward relationship, how did you calculate the L2 (using the ground truth future trajectory with a random sampled trajectory?) How many samples (25 frames a sample) do you use for getting average reward for a fixed L2 error (in Figure 10 left of your paper)?
Sorry for the confusion, here is how I get each point in Figure 10:
- Collect a subset by uniformly sampling from each command category on Waymo validation set. The resulting subset has 1500 samples in total.
- Create a list of increasing trajectory deviations with an explicit correlating strategy as specified in the paper.
- For each trajectory deviation, generate random offsets to perturb the ground truth trajectory, then perform reward estimation by denoising 10 steps with an ensemble size of 5.
- Go through the created subset, calculating the L2 errors and estimated rewards simultaneously. Once finished, average all errors and rewards to obtain a data point. The L2 errors in Figure 10 are obtained from a pre-defined trajectory deviation list, that's why the intervals between them are not strictly equivalent.
Thus, for each data point in Figure 10, I go through 1500 samples from Waymo validation set. For each sample, I generate a random trajectory using the same deviation and perform reward estimation. In short, it takes 1500 x 5 x 10 = 75000 denoising steps to get each data point. The most expensive figure I have ever drawn!
Thanks for your detailed response! When you calculate the L2 errors, do you just accumulate the norm for the future 4 points or average the norm over the four values?
Hi, first of all, thanks for open sourcing your work! I have three questions with respect to your video generation using nuScenes validation dataset:
In your nuScenes_val.json file, there are total 5369 samples (each sample contains 25 frames). This number does not match the validation samples in nuScenes (6019 frames). Is it because you filter out the frames that does not have future (2seconds) frames? I did a calculation, there are 5951 unique validation samples in your json file, and it seems 17 video clips do not have future frames (17*4=6019-5951) among the 150 scenes.
In your nuScenes_val.json file, the provided traj contains 10 elements, which I believe is the future 2 seconds ego trajectory points including its current position with respect to the first frame of that sample, in other words, the traj list format is like [x_current, y_current, x_current+0.5, y_current+0.5, x_current+1, y_current+1, x_current+1.5, y_current+1.5, x_current+2, y_current+2]. Correct me if I am wrong. Why don't we use 2.5 seconds future predictions since the default frame number is 25 and frequency is 10Hz?
To get the L2 vs reward relationship, how did you calculate the L2 (using the ground truth future trajectory with a random sampled trajectory?) How many samples (25 frames a sample) do you use for getting average reward for a fixed L2 error (in Figure 10 left of your paper)?
Thanks!