OpenDriveLab / DriveAGI

[Incl. GenAD, CVPR 2024 Highlight] Embracing Foundation Models into Autonomous Agent and System
https://arxiv.org/abs/2403.09630
Apache License 2.0
483 stars 17 forks source link

Clarification on Diffusion Timestep and Feature Extraction Process when using GenAD for Planning #15

Closed ABaldrati closed 3 weeks ago

ABaldrati commented 3 weeks ago

Dear Authors,

First of all, thank you for your excellent work on the paper. I have studied your approach and have some questions regarding the feature extraction process when using GenAD encoder for planning as described in the following paragraph:

Tab. 5 shows the planning results on nuScenes where ground truth poses are available for the ego vehicle. By freezing the GenAD encoder and only optimizing an additional MLP on top of it, the model can effectively learn to plan. In particular, by pre-extracting image features using GenAD's UNet encoder, the entire learning process for plan adaptation takes only 10 minutes on a single NVIDIA Tesla V100 device, which is 3400 times more efficient than training the UniAD planner.

I am particularly interested in understanding the following points:

  1. Since GenAD is a diffusion denoising network that takes a diffusion timestep as input, could you specify which diffusion timestep is used during the feature extraction process?
  2. Can you confirm that cleaned frames are used as input during this feature extraction process?
  3. In addition, I would appreciate if you could provide more details about
    • From which GenAD encoder layer (or layers) are the features extracted?
    • Could you please provide more details on how the features are processed in the MLP?

Thank you very much for your time and help.

Best regards,

Little-Podi commented 3 weeks ago

Hi Alberto,

Since GenAD is a diffusion denoising network that takes a diffusion timestep as input, could you specify which diffusion timestep is used during the feature extraction process?

We adopt timestep 0, which corresponds to the lowest diffusion level without noise. Note that GenAD's UNet takes $\sigma$ as a condition to indicate the timestep. The translation from timestep to $\sigma$ follows this implementation.

Can you confirm that cleaned frames are used as input during this feature extraction process?

Yes, original frames are applied without noise augmentation.

From which GenAD encoder layer (or layers) are the features extracted?

Our feature extraction process ends at the middle block of GenAD's UNet (see the illustration below), and all upsampling blocks are not used for this task.

image

Could you please provide more details on how the features are processed in the MLP?

We simply flatten the feature map after it is extracted by the UNet. The feature sequence is then sent to an MLP that regresses the planning waypoints.

ABaldrati commented 3 weeks ago

Great! Thank you so much for your detailed explanation!