PKU-MARL / Multi-Agent-Transformer

338 stars 72 forks source link

Question about causal attention in decoder #31

Open Porthoos opened 7 months ago

Porthoos commented 7 months ago

Hello, I'm instersted in your work, and I found a question whithout explanation in paper. I noticed that the causal attention in decoder uses a different structure unlike normal transformers:

  1. MAT causal attention uses encoder output as 'Key' and uses decoder self-attention output as 'Query' and 'Value', while normal transformers causal attention use encoder output as 'Query' and 'Value', and use decoder self-attention output as 'Key'.
  2. MAT residual connects the output of encoder after the causal attention, while normal transformers residual connect the output of decoder self-attention.

I have circled this in the figure, is there any reason to change the structure like this? E0XC2$0LN9H(MZ9R(X_Q0$X

morning9393 commented 2 months ago

hiya, thx for your attention, here we exchange the usage of obs_rep, which is different from the classical transformer. The intuitive reason is that, for marl problems here, the obs_rep from the encoder contained more information than the act_rep from the first attention block in the decoder (which is different from traditional NLP task, like translation, where they contained the same volume of information.). But for now, the second attention block in the decoder could be removed and its results should be theoretically the same, w.r.t the advantage decomposition.