Some questions about MAT-dec

lingxiao-guo commented 1 year ago

I am very interested in this series of work. The experimental results of MAT are impressive, but at the same time, it has a problem that requires autoregressive output, which can cause decision delay in practical applications. An interesting improvement would be to combine it with non-autoregressive transformer methods in natural language processing. In addition, the paper also presents the MAT-dec method, which does not require autoregressive output and still achieves performance exceeding the baseline in many scenarios. However, the algorithm for MAT-dec is not described in detail in the paper. After reading the code, I understand that MAT replaces the decoder with an MLP and directly uses PPO for training. If this is the case, I think the comparison between MAT and MAT-dec in the paper is not fair: MAT-dec lacks two attention blocks compared to MAT, and the algorithm does not use the idea of multi-agent sequential decision-making. Even so, MAT-dec still achieves performance close to MAT in many scenarios. To explore the potential of MAT-dec and demonstrate the effectiveness of decoder autoregression, a more appropriate comparison may be to replace the encoder of MAT-dec with three attention blocks and then connect an MLP, and use an algorithm similar to HAPPO. Did the author have any similar considerations during the experiments? What are your thoughts on this? Thank you very much for reply

morning9393 commented 1 year ago

Hiya, Thanks a lot for your interest to our work,

In deed, you are right that the auto-regressive process will be a bottleneck for scenarios requesting for quick response, and it is a good direction for further optimization. However, I didn't notice this problem before your suggestion since I'm not familiar with non-autoregressive transformer. After reading this paper (non-autoregressive neural machine translation (NAT), Gu+ 2018), I am a bit concerning that the conditional independence assumption for the output sequence (P(yt+1|x) is independent from P(yt|x)) of vanilla NAT may conflict with the Advantage Decomposition Theorem that P(yt+1|x, yt) relies on yt. Despite this, if there are some sota NAT methods variants solved the conflict, this way still worth exploring.

For the MAT-dec，at first we just tend to explore the role of the autoregressive process as a simple ablation without enough consideration about how to explore the potential of MAT-dec. Thanks you so much for your valuable suggestion (replace the encoder of MAT-dec with three attention blocks and then connect an MLP, and use an algorithm similar to HAPPO), I believe this has great potential to further improve MAT-dec's performance. I will try this idea later!

lingxiao-guo commented 1 year ago

Thank you very much for your email response! However, I haven't checked github for a long time, and only saw your reply today. Sorry about that haha. Regarding the assumption of independent conditions in NAT, which you mentioned may affect the advantageous decomposition theorem, some articles have indeed alleviated this contradiction.

I highly recommend [1], which provides a concise but elegant theoretical analysis of all NAT methods, allowing readers to have a unified and in-depth understanding of various NAT methods. Additionally, I also recommend [2] and [3], especially [2], which mentions that when the sentence length is less than 20 tokens, NAT's performance even surpasses AT, because it can simultaneously consider previous and latter contextual information during decoding, rather than only considering the previous context information during the decoding process like AT does. Perhaps, this is an interesting exploration direction for MAT as well. If you have interest or you could have, you could try this.

[1]Huang, Fei, et al. "On the learning of non-autoregressive transformers." International Conference on Machine Learning. PMLR, 2022. [2]Qian, Lihua, et al. "Glancing transformer for non-autoregressive neural machine translation." arXiv preprint arXiv:2008.07905 (2020). [3]Huang, Chenyang, et al. "Non-autoregressive translation with layer-wise prediction and deep supervision." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 10. 2022.

MPHarryZhang commented 1 year ago

Thank you very much for your email response! However, I haven't checked github for a long time, and only saw your reply today. Sorry about that haha. Regarding the assumption of independent conditions in NAT, which you mentioned may affect the advantageous decomposition theorem, some articles have indeed alleviated this contradiction.

I highly recommend [1], which provides a concise but elegant theoretical analysis of all NAT methods, allowing readers to have a unified and in-depth understanding of various NAT methods. Additionally, I also recommend [2] and [3], especially [2], which mentions that when the sentence length is less than 20 tokens, NAT's performance even surpasses AT, because it can simultaneously consider previous and latter contextual information during decoding, rather than only considering the previous context information during the decoding process like AT does. Perhaps, this is an interesting exploration direction for MAT as well. If you have interest or you could have, you could try this.

[1]Huang, Fei, et al. "On the learning of non-autoregressive transformers." International Conference on Machine Learning. PMLR, 2022. [2]Qian, Lihua, et al. "Glancing transformer for non-autoregressive neural machine translation." arXiv preprint arXiv:2008.07905 (2020). [3]Huang, Chenyang, et al. "Non-autoregressive translation with layer-wise prediction and deep supervision." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 10. 2022.

Thank you very much for your message. We are doing similar research work for MAT‘s quick response. Is it possible to discuss further with you?

PKU-MARL / Multi-Agent-Transformer

Some questions about MAT-dec #16