Some questions about Position-aware multi-head self-attentions.

MCG-NJU / AdaMixer

[CVPR 2022 Oral] AdaMixer: A Fast-Converging Query-Based Object Detector

MIT License

236 stars 24 forks source link

Some questions about Position-aware multi-head self-attentions. #6

Closed fushh closed 2 years ago

fushh commented 2 years ago

Thanks for your impressive work!

It seems that you don't use residual connection here and use content query with positional embedding as values. Do you find it helpful for performance？

sebgao commented 2 years ago

Oops this is a minor model bug here and we use query_content+pe as the input and the residual. But we should use query_content as the residual here. We used the former one through all experiments in the paper.

We have not yet experimented it with the latter form. Theoretically the latter form will have more potential expressing power since it is more flexible to following operations with regards to the problem whether it should incorporate PE into the content vector.

fushh commented 2 years ago

Thanks for your reply. Do you mean it is still proper to use query_content+pe as values in MultiHead Self-Attention? But, to the best of my knowledge, PE only is only added at queries and keys. I am not sure whether it is ok here.

sebgao commented 2 years ago

Yes, but empirically. Experiments in our paper show strong results with query_content+pe as values. For only adding PE to queries and keys, I would say it is a preferred design choice in DETR-like models. In other models, like ViT, PE is also added to values.

But in my opinion, query_content_pe adding to only queries and keys will be a better choice in a theoretical view in this case (disclaimer: I have not tested with it yet due to limited GPUs 😢).

fushh commented 2 years ago

Thanks for your help（。＾▽＾）

fushh commented 1 year ago

你好，你的邮件付盛豪已经收到啦，会尽快处理哒~