Closed xiaosly closed 6 months ago
Hi. The HA series algorithms, including HADDPG, HATD3, and HASAC, all require advantage decomposition. Advantage decomposition lemma is used in the proof of Lemma 2 in HAML and Proposition 1 in MEHARL, so the motivation for sequential updates still stems from the advantage decomposition lemma. As for the main difference between MADDPG and HADDPG, it essentially lies in the theoretical guarantees brought by sequential update compared to simultaneous update. If you have any other confusion while reading the paper, feel free to discuss them with me.
Thanks for your reply. Sorry for the naive questions regarding this paper. So doe the entire paper aims to prove the theoritical guarantee under the sequential update and extend to the off policy algorithm? Since sequential update starts from TRPO and PPO as HATRPO and HAPPO while few work focused on off policy algorithm, am I understand right? Thus the main diference from the author's code compared with the conventional CTDE method is the sequential policy update method . Meanwhile, the author tried to prove that the proposed one with theoritical guarantee is solid rather than tricky modification (sequential update)? I am new in this area, that is my plain understanding.
Thanks for your reply. Sorry for the naive questions regarding this paper. So doe the entire paper aims to prove the theoritical guarantee under the sequential update and extend to the off policy algorithm? Since sequential update starts from TRPO and PPO as HATRPO and HAPPO while few work focused on off policy algorithm, am I understand right? Thus the main diference from the author's code compared with the conventional CTDE method is the sequential policy update method . Meanwhile, the author tried to prove that the proposed one with theoritical guarantee is solid rather than tricky modification (sequential update)? I am new in this area, that is my plain understanding.
Sorry for the late reply. Yes, your understanding is mostly correct. Proving the theoretical guarantees of off-policy algorithms with sequential update is one of the goals of our paper. A more important objective is to incorporate drift functional and neighborhood operators into sequential update, namely the mirror learning framework, which still assures theoretical guarantees. Thus, by selecting appropriate operators, we can derive a broader range of theoretically guaranteed algorithms.
Thanks for your reply. What's the drawback of not applying the drift functional and neighbourhood operators? I did not catch this insight here.
Using appropriate operators may stabilize the updates. You can take a look at the mirror learning paper, where these operators were first introduced.
No further questions so far, Thanks for your help!!
The question for off policy series HAxxx algorithm, such as HADDPG and HASAC. Does it require advantage decomposition? And why the hasac paper still put advantage decomposition as the lemma, it is confused here. And if the decomposition lemma is not supported for off policy HA algorithm, how the sequential update motivation. can be explained here. And the last question here is , what's the main difference here between HADDPG and MADDPG besides the sequential update? The mirror learning framework? That's the main confustion for reading the paper and hope you can reply!