Open Porthoos opened 7 months ago
hiya, thx for your attention, here we exchange the usage of obs_rep, which is different from the classical transformer. The intuitive reason is that, for marl problems here, the obs_rep from the encoder contained more information than the act_rep from the first attention block in the decoder (which is different from traditional NLP task, like translation, where they contained the same volume of information.). But for now, the second attention block in the decoder could be removed and its results should be theoretically the same, w.r.t the advantage decomposition.
Hello, I'm instersted in your work, and I found a question whithout explanation in paper. I noticed that the causal attention in decoder uses a different structure unlike normal transformers:
I have circled this in the figure, is there any reason to change the structure like this?