A few questions on details

Divadi commented 11 months ago

Thank you for releasing this amazing work! I just had a couple questions on some of the camera-only outdoor details @Nightmare-n

Are the same 6 images used for generating the 3D voxel grid and rendering (as is mentioned to be done for ScanNet in PonderV2)?
Were the used ConvNeXt(V1?) backbones trained from scratch or with IN1k?
Was any data augmentation used for the 2D training stage, besides regular MAE masking?
When using the proposed depth-aware sampling, are the same 512 rays, sampled from the pixels with available LiDAR points, used for both color & depth rendering?
In Table 8g, there appear to be trainable weights associated with the view transformation stage. Does the view transformation generally follow UVTR(?) with multi-scale sampling and depth weighting? Or perhaps single-scale?
In PonderV2, supplementary is mentioned. Would it be possible for this to be made public?
What (total) batch size was used?

Apologies for the list of questions, but I'm really interested in the work. Again, thank you so much in advance!

Nightmare-n commented 11 months ago

Hi, thanks for your interest ans question!

Yes, we use six multi-view images in nuScenes to generate 3D voxel grids.
We use ConvNeXt-V1 pre-trained on IN1k.
We only use MAE masking, as we do not observe obvious improvements when using other data augmentation like flip and resize.
Yes, we sample 512 rays for both color and depth rendering.
The view transformation contains multi-scale sampling and depth weighting, consistent with UVTR.
This seems to be a typo, which is copied directly from PonderV1.
We use 1 batch size per GPU and 4 GPUs for the camera-based model, which is the same as UVTR.

Divadi commented 10 months ago

@Nightmare-n Thank you so much for your earlier response. I think this is a strong work, so I've been trying to reproduce it! Unfortunately my results are not as good, and I wanted to get your guidance on some implementation details.

Specifically, I am trying to create this result:

Here is what I've tried.

Reproducing UVTR-C (Baseline)

First, I tried loading in ConvNeXt-S weights as in mmdetection, which are from the official repo. Training for 12 epochs on load_interval=2, it achieves 20.9/22.4 for NDS/mAP. I tried removing augmentations and got 22.7/25.1, which is still lower NDS than shown. I based my code on UVTR camera base, adding layer-wise LR decay.

Next, I saw you reference SparK pre-training, so I tried their pre-trained ConvNeXt-S weights. I also increased unified volume size to 180x180x5 (it's not mentioned what volume size you use for UVTR). This now gets 24.8/28.3 - NDS is still worse, but mAP is now too good...

Could I get guidance on the exact settings you used for this?

Reproducing UniPAD

I load in the SparK ConvNeXt-S checkpoint, then mask 0.3 and do 2D masked backbone as in SparK.
Then, aggregate the 6 image views into the 180x180x5 volume as in Camera-UVTR (no lidar info). 3 conv-bn-relu 3D CNNs, then CNN to 32 channels.
For each image view, I sample 512 pixels with valid depth (from 10 past LiDAR sweeps).
Ray sample 96 points uniformly from 1m to 64m. (No importance sampling)
Grid sample features with align_corners, input into NeuS SDF with 10-freq XYZ (-1 ~ 1 norm) posemb, 6 layers, 32 hidden, get SDF + 32 feats.
Put XYZ, volume feats, 32 SDF feats, 4-freq view, SDF gradient into 4-layer color network to get RGB.
Render RGB & Depth, supervise w/ RGB weight 10, Depth weight 10. Notably, I just directly do: between consecutive points and don't do the special NeuS gradient * viewdir thing, because I felt that without eikonal loss, there are less guarantees on gradient magnitude.
For points outside of volume, their density & RGB are just set to 0.

With this, training 6 epochs (I know this shouuld be 12, but for faster experiments) with load_interval=1, no augmentations, then fine-tuning UVTR (removing the 3 conv-bn-relu 3D CNNs), I get 26.9/31.7. Compared to 24.8/28.3 baseline, it's a +2.1/+3.4 improvement, but it's really far from the +7.7/+9.6 in paper. Notably, NDS doesn't improve too much...

EDIT: Trained for 12 epochs, achieves just 27.2/32.1

Do you have any ideas on what I can fix or what I may be missing?

Thank you so much in advance! If you are okay with having a more in-depth discussion, please feel free to email me directly as well.

Nightmare-n commented 10 months ago

Hi, the code is released!

Nightmare-n / UniPAD

A few questions on details #2

Reproducing UVTR-C (Baseline)

Reproducing UniPAD