chaytonmin / UniScene

Official implementation of our RAL'24 paper: Multi-Camera Unified Pre-training for Autonomous Driving
MIT License
210 stars 14 forks source link

Have you tried Occ-BEV pretraining in BEVDet-OCC? #6

Open SeaBird-Go opened 1 year ago

SeaBird-Go commented 1 year ago

Hi, thanks for sharing this work.

When I pretrained the BEVDet-OCC by predicting the binary occupancy and then finetune this model, the results seem improved almost nothing, when compared with finetune the model with the pretraind ResNet-50 model.

I don't know why. And I observed that you conduct the occupancy prediction experiments on the BEVStereo with a stronger backbone and larger image size (256x704 in my setting), so I wonder whether the backbone and image size could have fatal effects on the pretraining performances.

chaytonmin commented 1 year ago

1 The backbone and image size might not have effects on the pretraining performances. The Occ label must be obtained by fusing multiple frame point clouds, and a single frame point cloud is too sparse to improve. 2 BEVDet-OCC uses the pre-trained model from BEVDet. You'd better also use this pre-trained model for initialization when training binary occupancy prediction, for the sake of fairness. 3 Focal loss is better than cross-entropy. 4 Semantic Occ is more conducive to pre-training.

SeaBird-Go commented 1 year ago

Thanks for your detailed explanation. And sorry to reply here since so busy these days.

  1. I understand that the occupancy GT should be obtained from the multiple sweeps point clouds. Since the BEVDet-OCC use the semantic occupancy GT from the CVPR 2023 occupancy challenge. So I just let the voxels not belong to free category be the occupied voxels, in this manner, I obtained the binary occupancy GT.
  2. I know the BEVDet-OCC uses the pre-trained BEVDet model as the initialization. So for the sake of fairness, I just loaded the pre-trained ResNet-50 model to initialize the backbone, and then finetuned the semantic occupancy prediction. I obtained the mIOU is 34.01 in this case.

For the pertaining case, I also initialized the backbone with the pre-trained ResNet-50 model, and then pretraining the BEVDet-OCC with the binary occupancy prediction. After that, I finetune the semantic occupancy prediction. I obtained the mIOU is 34.79 in this case.

The performance has been improved very slightly. I'm not sure what the problem.

chaytonmin commented 1 year ago

Thanks for your detailed explanation. And sorry to reply here since so busy these days.

  1. I understand that the occupancy GT should be obtained from the multiple sweeps point clouds. Since the BEVDet-OCC use the semantic occupancy GT from the CVPR 2023 occupancy challenge. So I just let the voxels not belong to free category be the occupied voxels, in this manner, I obtained the binary occupancy GT.
  2. I know the BEVDet-OCC uses the pre-trained BEVDet model as the initialization. So for the sake of fairness, I just loaded the pre-trained ResNet-50 model to initialize the backbone, and then finetuned the semantic occupancy prediction. I obtained the mIOU is 34.01 in this case.

For the pertaining case, I also initialized the backbone with the pre-trained ResNet-50 model, and then pretraining the BEVDet-OCC with the binary occupancy prediction. After that, I finetune the semantic occupancy prediction. I obtained the mIOU is 34.79 in this case.

The performance has been improved very slightly. I'm not sure what the problem.

This result is normal. Oc-BEV does not improve as much in scene completion tasks as 3D detection, and the first version of my paper is similar to this result.

zhanghm1995 commented 1 year ago

@chaytonmin So what are the key changes you have made to achieve the 3.14% improvements in your latest version paper?

chaytonmin commented 1 year ago

@chaytonmin So what are the key changes you have made to achieve the 3.14% improvements in your latest version paper?

TTA