czczup / ViT-Adapter

[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
https://arxiv.org/abs/2205.08534
Apache License 2.0
1.21k stars 135 forks source link

In deformable attention, why sampling_offsets.bias is initialized as arithmetic progression and set to be no gradient #111

Open peterant330 opened 1 year ago

peterant330 commented 1 year ago

Hi, This is really cool work. But I have some difficulties to understand these code:

https://github.com/czczup/ViT-Adapter/blob/968f6b008bdc4f84e2a637c986acc139b38e8083/detection/ops/modules/ms_deform_attn.py#L66-L72

I am curious about the mechanism behind how you initialize sampling_offsets.bias and why it is frozen during the training.

czczup commented 1 year ago

sampling_offsets.bias is not frozen during training, because the no_grad here will not take effect.

About the initialization, in simple terms, this initialization is to place the sampling points on the circumference around the quiry point.

You can watch this video for more information about deformable attention.

peterant330 commented 1 year ago

sampling_offsets.bias is not frozen during training, because the no_grad here will not take effect.

About the initialization, in simple terms, this initialization is to place the sampling points on the circumference around the quiry point.

You can watch this video for more information about deformable attention.

Thanks for your explanation. I guess you want to make the sampling points to form a circle around the query. However, I don't understand why the length of thetas is n_heads rather than n_points, and what is the function of the for loop. If you only have one head but multiple sampling points, then I guess you will have n points that form a line starting from the reference point.