TRAILab / CaDDN

Categorical Depth Distribution Network for Monocular 3D Object Detection (CVPR 2021 Oral)
Apache License 2.0
359 stars 62 forks source link

Multiple camera fusion #78

Closed taylover-pei closed 2 years ago

taylover-pei commented 2 years ago

Thanks for your great work!

Here, I want to input four camera images (front, left, right, back) into the network at the same time to get a full BEV features based on your code. So, how to modify your code to implement multi-camera fusion module? Can you give me some suggestions?

Thank you very much!

codyreading commented 2 years ago

Sure, so this is technically feasible, but will require a lot of compute for CaDDN in its current form. You will likely have to reduce your resolutions of frustum/voxel grids in order to fit this on a GPU.

You will need to run the frustum feature network for each of the camera views to generate independent frustum features for each view. You will likely want separate networks for each view, but you can try using shared weights for the frustum feature networks at first.

There will only be one voxel grid, and for each voxel you will project its center into all frustum views to extract the relevant frustum features. In most cases, each voxel will only project into one frustum view (with the FOV of the camera), so you can just simple extract that feature. In the case that a voxel projects into two different views, you can just average the two features to populate the voxel feature.

Once you've constructed the voxel grid, the collapse to the BEV grid is unchanged.