M = 8 and K = 4 are set for deformable attentions by default.
K number of sampled keys in each feature level for each attention head
M number of attention heads
I am not sure how much of a difference it is going to make but just to warn other people.
Hi, thank you for the nice work and for sharing your code!
I believe that your implementation of BEVFormer has a small bug: https://github.com/aharley/simple_bev/blob/be46f0ef71960c233341852f3d9bc3677558ab6d/nets/bevformernet.py#L296
It looks like the values for the parameters
n_heads
andn_points
have been swapped compared to the normal initialization https://github.com/aharley/simple_bev/blob/be46f0ef71960c233341852f3d9bc3677558ab6d/nets/ops/modules/ms_deform_attn.py#L31See also the original implementation of BEVFormer:
def __init__(self, embed_dims=256, num_heads=8, num_levels=4, num_points=4,
https://github.com/fundamentalvision/BEVFormer/blob/20923e66aa26a906ba8d21477c238567fa6285e9/projects/mmdet3d_plugin/bevformer/modules/decoder.py#L160-L164as well as the Deformable DETR paper:
I am not sure how much of a difference it is going to make but just to warn other people.