About the multi-head attention

Atten4Vis / ConditionalDETR

This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence". (https://arxiv.org/abs/2108.06152)

Apache License 2.0

369 stars 50 forks source link

About the multi-head attention #25

Closed LiewFeng closed 2 years ago

LiewFeng commented 2 years ago

I find that you re-implement the Multi-head Attention in models/attention.py. Are there any difference from the original implementation? Since the code is very long, it's kind of hard for me to find the difference. Could you kindly tell me? Thanks!

DeppMeng commented 2 years ago

Hi, basically in the original pytorch multi-head attention, the query, key, value projections are included in it. But in conditionalDETR, we separate the content and position query/keys (as mentioned in our paper Section 3.3 first paragraph). To achieve this, we directly remove the query, key, value projections from the attention and manually apply them outside the attention module for flexibility.

LiewFeng commented 2 years ago

Get. Thanks a lot!

MasterSkepticista commented 6 months ago

Hi @DeppMeng, could you reason as to why content and position embeddings are projected by Linear before adding/concatenating content and position embeddings in the decoder?

The original Transformer, along with most framework implementations, add before projecting them through Linear for computing attention.

Is there a paper I could refer to, perhaps?