fundamentalvision / BEVFormer

[ECCV 2022] This is the official implementation of BEVFormer, a camera-only framework for autonomous driving perception, e.g., 3D object detection and semantic map segmentation.
https://arxiv.org/abs/2203.17270
Apache License 2.0
3.32k stars 542 forks source link

Questions about BEV query and object query. #2

Closed BB88Lee closed 2 years ago

BB88Lee commented 2 years ago

Thanks for sharing this great work!

I have questions about BEV query and object query (for detection). As mentioned in the paper, the number of BEV Queries is 200 x 200, for object Queries its 900. Also, 1x Spatial cross-attention and 6x temporal self-attention are attention layers working with BEV queries for the current frame Bev queries generation.

Where are the cross-attention layers between BEV Queries and object Queries? Is that in the detection head, are there still 6 cross-attention layers for the interaction between BEV Queries and object Queries?

zhiqi-li commented 2 years ago

Thanks for your question.

BEVFormer is more concerned with the design of the BEV encoder. Therefore, we omit the structure of the detection head in Figure 2. Specifically, our detection head is almost the same as the decoder of deformable DETR, and the object query and BEV features interact through cross-attention.

BB88Lee commented 2 years ago

Got it, thanks!

exiawsh commented 2 years ago

Thanks for your question.

BEVFormer is more concerned with the design of the BEV encoder. Therefore, we omit the structure of the detection head in Figure 2. Specifically, our detection head is almost the same as the decoder of deformable DETR, and the object query and BEV features interact through cross-attention.

Hello, could you please provide some details of the detection head.? Are the queries in detection head also learnable just like in one stage deformable detr? And how many queries in the detection head.