Question about causal attention in decoder

Hello, I'm instersted in your work, and I found a question whithout explanation in paper. I noticed that the causal attention in decoder uses a different structure unlike normal transformers:

MAT causal attention uses encoder output as 'Key' and uses decoder self-attention output as 'Query' and 'Value', while normal transformers causal attention use encoder output as 'Query' and 'Value', and use decoder self-attention output as 'Key'.
MAT residual connects the output of encoder after the causal attention, while normal transformers residual connect the output of decoder self-attention.

I have circled this in the figure, is there any reason to change the structure like this? E0XC2$0LN9H(MZ9R(X_Q0$X

PKU-MARL / Multi-Agent-Transformer

Question about causal attention in decoder #31