danielwillemsen / MAMBPO

DecentralizedLearning
19 stars 1 forks source link

Question about MASAC #3

Open pengzhenghao opened 3 years ago

pengzhenghao commented 3 years ago

Hi Daniel! Thanks for this excellent repo! I enjoy reading this paper too!

Here are a little question on the baseline MASAC in your paper.

image

In the above equation, you do not provide detail of the Q target y. So I guess this is just the normal target as:

y = R(ot, at) + gamma * Q_i(ot+1, at+1) - Q_i(ot,at)

right? But you don't have a subscript "i" in y, so maybe I am wrong?

If that is true, then can I just consider MASAC is a maximum-entropy variant of the independent Q learning with centralized critics for each agent?

Thanks!!

danielwillemsen commented 3 years ago

Hi, thanks for the question and my apologies for the late reply.

It would have been clearer and better for me to use the subscript i in the y, because these targets are indeed individual for each Q network. You are definitely right about that.

About your comparison with independent Q-learning: I do think there are a few more differences with independent Q-learning: Due to the centralized critics, MASAC is a centralized learning for decentralized execution algorithm I believe independent Q-learning is a pure independent learners algorithm. In addition, I don't think independent Q-learning usually uses actors nor is able to use continuous action spaces. In addition, I think that the fact that centralized critics are used is a fairly important distinction as it

I believe that the algorithm closest to MASAC would perhaps be MADDPG, since MASAC is simply a maximum-entropy variant of the MADDPG algorithm.