[ECCV 2022] This is the official implementation of BEVFormer, a camera-only framework for autonomous driving perception, e.g., 3D object detection and semantic map segmentation.
To save GPU memory, you have divided the bev query (e.g. 500500) into num_camera (e.g. 6) groups which has the shape, for example (6, 610, 256). And then input them to the function sampling_offsets. https://github.com/zhiqi-li/BEVFormer/blob/5d42632256c64742f74d8b1c68a3407dd2f81305/projects/mmdet3d_plugin/bevformer/modules/spatial_cross_attention.py#L249 That means for each bev query, It will get num_heads num_levels * num_points offset xy (e.g 8 heads, 3 levels, 8 offset points). For most points that are projected under only one camera, that is okay. However, in the overlap area which means for some bev query or their reference points, they will hit 2 views. However for these bev queries, after the linear layer (i.e. sampling_offsets), the offset is the same for both the two views. Also, the attention weights I think are also the same for them. Because we not predict offsets or weights for all cameras for each bev query. Is that make sense? Or is there any special meaning to doing so?
In fact, in the process I reproduced before, I predicted the offset on each camera for each bev query. In order to save the GPU memory, I also only selected the mapped bev query for attention. But the offsets and weight are predicted for all the views. I just use the values (offsets and weights) that are mapped on the corresponding views for attention calculation. My final performance is not good. I am not sure if there are other bugs. But I am curious if the above is the influencing factor.
Hi, Zhiqi.
To save GPU memory, you have divided the bev query (e.g. 500500) into num_camera (e.g. 6) groups which has the shape, for example (6, 610, 256). And then input them to the function sampling_offsets. https://github.com/zhiqi-li/BEVFormer/blob/5d42632256c64742f74d8b1c68a3407dd2f81305/projects/mmdet3d_plugin/bevformer/modules/spatial_cross_attention.py#L249 That means for each bev query, It will get num_heads num_levels * num_points offset xy (e.g 8 heads, 3 levels, 8 offset points). For most points that are projected under only one camera, that is okay. However, in the overlap area which means for some bev query or their reference points, they will hit 2 views. However for these bev queries, after the linear layer (i.e. sampling_offsets), the offset is the same for both the two views. Also, the attention weights I think are also the same for them. Because we not predict offsets or weights for all cameras for each bev query. Is that make sense? Or is there any special meaning to doing so?
In fact, in the process I reproduced before, I predicted the offset on each camera for each bev query. In order to save the GPU memory, I also only selected the mapped bev query for attention. But the offsets and weight are predicted for all the views. I just use the values (offsets and weights) that are mapped on the corresponding views for attention calculation. My final performance is not good. I am not sure if there are other bugs. But I am curious if the above is the influencing factor.
Thanks.