Nightmare-n / UniPAD

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving (CVPR 2024)
https://arxiv.org/abs/2310.08370
Apache License 2.0
167 stars 7 forks source link

A few questions on details #2

Closed Divadi closed 6 months ago

Divadi commented 11 months ago

Thank you for releasing this amazing work! I just had a couple questions on some of the camera-only outdoor details @Nightmare-n

  1. Are the same 6 images used for generating the 3D voxel grid and rendering (as is mentioned to be done for ScanNet in PonderV2)?
  2. Were the used ConvNeXt(V1?) backbones trained from scratch or with IN1k?
  3. Was any data augmentation used for the 2D training stage, besides regular MAE masking?
  4. When using the proposed depth-aware sampling, are the same 512 rays, sampled from the pixels with available LiDAR points, used for both color & depth rendering?
  5. In Table 8g, there appear to be trainable weights associated with the view transformation stage. Does the view transformation generally follow UVTR(?) with multi-scale sampling and depth weighting? Or perhaps single-scale?
  6. In PonderV2, supplementary is mentioned. Would it be possible for this to be made public?
  7. What (total) batch size was used?

Apologies for the list of questions, but I'm really interested in the work. Again, thank you so much in advance!

Nightmare-n commented 11 months ago

Hi, thanks for your interest ans question!

  1. Yes, we use six multi-view images in nuScenes to generate 3D voxel grids.
  2. We use ConvNeXt-V1 pre-trained on IN1k.
  3. We only use MAE masking, as we do not observe obvious improvements when using other data augmentation like flip and resize.
  4. Yes, we sample 512 rays for both color and depth rendering.
  5. The view transformation contains multi-scale sampling and depth weighting, consistent with UVTR.
  6. This seems to be a typo, which is copied directly from PonderV1.
  7. We use 1 batch size per GPU and 4 GPUs for the camera-based model, which is the same as UVTR.
Divadi commented 10 months ago

@Nightmare-n Thank you so much for your earlier response. I think this is a strong work, so I've been trying to reproduce it! Unfortunately my results are not as good, and I wanted to get your guidance on some implementation details.

Specifically, I am trying to create this result: image

Here is what I've tried.

Reproducing UVTR-C (Baseline)

First, I tried loading in ConvNeXt-S weights as in mmdetection, which are from the official repo. Training for 12 epochs on load_interval=2, it achieves 20.9/22.4 for NDS/mAP. I tried removing augmentations and got 22.7/25.1, which is still lower NDS than shown. I based my code on UVTR camera base, adding layer-wise LR decay.

Next, I saw you reference SparK pre-training, so I tried their pre-trained ConvNeXt-S weights. I also increased unified volume size to 180x180x5 (it's not mentioned what volume size you use for UVTR). This now gets 24.8/28.3 - NDS is still worse, but mAP is now too good...

Could I get guidance on the exact settings you used for this?

Reproducing UniPAD

With this, training 6 epochs (I know this shouuld be 12, but for faster experiments) with load_interval=1, no augmentations, then fine-tuning UVTR (removing the 3 conv-bn-relu 3D CNNs), I get 26.9/31.7. Compared to 24.8/28.3 baseline, it's a +2.1/+3.4 improvement, but it's really far from the +7.7/+9.6 in paper. Notably, NDS doesn't improve too much...

EDIT: Trained for 12 epochs, achieves just 27.2/32.1

Do you have any ideas on what I can fix or what I may be missing?

Thank you so much in advance! If you are okay with having a more in-depth discussion, please feel free to email me directly as well.

Nightmare-n commented 10 months ago

Hi, the code is released!