Several questions on your backbone implementation(`BaseLSSFPN` and `FusionLSSFPN`)

Francis777 commented 1 year ago

Hello BEVDepth authors, I have the following questions when looking at your backbone implementation:

It looks like the only major difference between FusionLSSFPN and BaseLSSFPN is how we get depth: BaseLSSFPN predicts depth from image features+camera calibration input, while FusionLSSFPN uses lidar gt depth, but instead of directly using it, it's further processed by depth_gt_conv -> depth_conv -> aspp -> depth_pred, https://github.com/Megvii-BaseDetection/BEVDepth/blob/d9d9639794539dcddbf3c1defd3d5f043d4a20e5/bevdepth/layers/backbones/fusion_lss_fpn.py#L64 what's the reason behind this? More generally, using lidar gt depth as input makes the model not camera-only anymore, so what's this backbone used for?
The paper mentions adding a SE-like layer for camera-aware depth prediction, but in BaseLSSFPN it's similarly added to image features as well, https://github.com/Megvii-BaseDetection/BEVDepth/blob/d9d9639794539dcddbf3c1defd3d5f043d4a20e5/bevdepth/layers/backbones/base_lss_fpn.py#L249 what's the benefit of making image features "camera-aware"? I guess it probably doesn't affect the result very much but still good to know if it does have some influence.
Could you elaborate more on the DepthAggregation module? It's not mentioned in the paper so I'd appreciate some reference on why adding this helps.

Look forward to your reply!

p.s. The "Delving into Depth Prediction in Lift-splat" analysis in your paper v2 is super interesting, thanks for making the update!

yinchimaoliang commented 1 year ago

Hello, thanks for your attention. For the first question, it is more like a experimental setting. We use fusion exps to see how much performance boost it can bring if we introduce lidar depth prior. It also proves that with little modification, BEVDepth can be transformed from camera-only model to fusion model.

For the second question, it doesn't bring much influence indeed. However, since context and depth will be multiplied in the future, it makes sense to add camera-aware for the context feature.

For the third question, The img_feat_with_depth is not robust enough, so we add some extra layers to further extract features.

Zhangwenyao1 commented 1 year ago

S

Hi，have you solved this problem? If I want to compare the method, which python file should I run? (In fusion? or In mv?)

Megvii-BaseDetection / BEVDepth

Several questions on your backbone implementation(`BaseLSSFPN` and `FusionLSSFPN`) #120