Question about fusion pretrain?

Nightmare-n / UniPAD

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving (CVPR 2024)

https://arxiv.org/abs/2310.08370

Apache License 2.0

150 stars 5 forks source link

Question about fusion pretrain? #12

Closed hottruong98 closed 3 months ago

hottruong98 commented 3 months ago

Hi, thank you for your amazing work.

As indicated in your project config files, the weights for finetuning fusion-based model are got by "merging the weights of uvtr_lidar and uvtr_cam". So, from that statement, uvtr_lidar and uvtr_cam (pretraining phase) are trained individually before, and you just combine state_dicts of those 2 pretrained models to use for finetuning fusion-based model.

But when I first read your paper, I thought the fusion weights are got by training both lidar and camera branches "simultaneously" (as indicated in your framework), which is quite different from the above implementation.

This makes me confused and I want to ask you if I am understanding your ideas correctly. Is "camera-based pretrained weights + lidar-based pretrained weights = fusion-based pretrained weights" what you mean in your paper too?

Thank you so much in advance.

Nightmare-n commented 3 months ago

Hi, thanks for your interest.

The official training pipeline of uvtr-m is: scratch uvtr-c, scratch uvtr-l -> finetune uvtr-m, where the uvtr-m uses trained weights of scratch uvtr-c and scratch uvtr-l for initialization.

For our pre-training setting, the pipeline is: pretrain uvtr-c, pretrain uvtr-l -> finetune uvtr-c, finetune uvtr-l -> finetune uvtr-m, where the uvtr-m also uses trained weights of finetune uvtr-c and finetune uvtr-l for initialization.

There indeed exist other pretraining pipelines: pretrain uvtr-m -> finetune uvtr-m or pretrain uvtr-m -> finetune uvtr-c, finetune uvtr-l -> finetune uvtr-m, but it would cost more training memory and computational costs, thus we do not carefully verify these pipelines (these pipelines can be easily implemented by directly merging the config of uvtr_lidar_vs0.075_pretrain.py and uvtr_cam_vs0.075_pretrain.py).

hottruong98 commented 3 months ago

Thank you for your response. You are right, pretraining uvtr-m is quite expensive. May I have one last question? Given your framework, do you think we would finally get similar performance using [pretrain uvtr-m -> finetune uvtr-m] pipeline instead of your current setting [pretrain uvtr-c, pretrain uvtr-l -> finetune uvtr-c, finetune uvtr-l -> finetune uvtr-m]?

Nightmare-n commented 3 months ago

For the baseline of uvtr model, scratch uvtr-c, scratch uvtr-l -> finetune uvtr-m performs better than scratch uvtr-m, so I think that pretrain uvtr-c, pretrain uvtr-l -> finetune uvtr-c, finetune uvtr-l -> finetune uvtr-m would also run better than pretrain uvtr-m -> finetune uvtr-m.

hottruong98 commented 3 months ago

Okay. I got that. Thank you again for your explanation. Have a nice day ~