Closed BB88Lee closed 2 years ago
Thanks for your question.
BEVFormer is more concerned with the design of the BEV encoder. Therefore, we omit the structure of the detection head in Figure 2. Specifically, our detection head is almost the same as the decoder of deformable DETR, and the object query and BEV features interact through cross-attention.
Got it, thanks!
Thanks for your question.
BEVFormer is more concerned with the design of the BEV encoder. Therefore, we omit the structure of the detection head in Figure 2. Specifically, our detection head is almost the same as the decoder of deformable DETR, and the object query and BEV features interact through cross-attention.
Hello, could you please provide some details of the detection head.? Are the queries in detection head also learnable just like in one stage deformable detr? And how many queries in the detection head.
Thanks for sharing this great work!
I have questions about BEV query and object query (for detection). As mentioned in the paper, the number of BEV Queries is 200 x 200, for object Queries its 900. Also, 1x Spatial cross-attention and 6x temporal self-attention are attention layers working with BEV queries for the current frame Bev queries generation.
Where are the cross-attention layers between BEV Queries and object Queries? Is that in the detection head, are there still 6 cross-attention layers for the interaction between BEV Queries and object Queries?