Closed LiewFeng closed 2 years ago
Hi, basically in the original pytorch multi-head attention, the query, key, value projections are included in it. But in conditionalDETR, we separate the content and position query/keys (as mentioned in our paper Section 3.3 first paragraph). To achieve this, we directly remove the query, key, value projections from the attention and manually apply them outside the attention module for flexibility.
Get. Thanks a lot!
Hi @DeppMeng, could you reason as to why content and position embeddings are projected by Linear
before adding/concatenating content and position embeddings in the decoder?
The original Transformer, along with most framework implementations, add before projecting them through Linear
for computing attention.
Is there a paper I could refer to, perhaps?
I find that you re-implement the Multi-head Attention in models/attention.py. Are there any difference from the original implementation? Since the code is very long, it's kind of hard for me to find the difference. Could you kindly tell me? Thanks!