Open pengzhenghao opened 3 years ago
Hi, thanks for the question and my apologies for the late reply.
It would have been clearer and better for me to use the subscript i in the y, because these targets are indeed individual for each Q network. You are definitely right about that.
About your comparison with independent Q-learning: I do think there are a few more differences with independent Q-learning: Due to the centralized critics, MASAC is a centralized learning for decentralized execution algorithm I believe independent Q-learning is a pure independent learners algorithm. In addition, I don't think independent Q-learning usually uses actors nor is able to use continuous action spaces. In addition, I think that the fact that centralized critics are used is a fairly important distinction as it
I believe that the algorithm closest to MASAC would perhaps be MADDPG, since MASAC is simply a maximum-entropy variant of the MADDPG algorithm.
Hi Daniel! Thanks for this excellent repo! I enjoy reading this paper too!
Here are a little question on the baseline MASAC in your paper.
In the above equation, you do not provide detail of the Q target y. So I guess this is just the normal target as:
right? But you don't have a subscript "i" in y, so maybe I am wrong?
If that is true, then can I just consider MASAC is a maximum-entropy variant of the independent Q learning with centralized critics for each agent?
Thanks!!