A question about attention weight calculation

Li-Qingyun commented 2 years ago

Hi~ Thanks for your excellent work! I'm confused about an operation about attention weight calculation.

In the implementation of the attention, there is a small modification, which i have not found in the paper.

# previous choise of conditional detr and nn.MultiheadAttention
attn_output_weights = softmax(attn_output_weights, dim=-1)

# DAB-DETR modified this line:
attn_output_weights = softmax(attn_output_weights - attn_output_weights.max(dim=-1, keepdim=True)[0], dim=-1)

Whether or not this procedure refers to some previous studies, which i have not been read. Will doing this improve the performance?

SlongLiu commented 2 years ago

It is a trick to stable the training of attention modules. It has little influence on the final performance.

Li-Qingyun commented 2 years ago

It is a trick to stable the training of attention modules. It has little influence on the final performance.

Thanks for your reply.

IDEA-Research / DAB-DETR

A question about attention weight calculation #48