PKU-MARL / HARL

Official implementation of HARL algorithms based on PyTorch.
484 stars 59 forks source link

Question for the offpolicy HA algorithm #36

Closed xiaosly closed 6 months ago

xiaosly commented 6 months ago

The question for off policy series HAxxx algorithm, such as HADDPG and HASAC. Does it require advantage decomposition? And why the hasac paper still put advantage decomposition as the lemma, it is confused here. And if the decomposition lemma is not supported for off policy HA algorithm, how the sequential update motivation. can be explained here. And the last question here is , what's the main difference here between HADDPG and MADDPG besides the sequential update? The mirror learning framework? That's the main confustion for reading the paper and hope you can reply!

guazimao commented 6 months ago

Hi. The HA series algorithms, including HADDPG, HATD3, and HASAC, all require advantage decomposition. Advantage decomposition lemma is used in the proof of Lemma 2 in HAML and Proposition 1 in MEHARL, so the motivation for sequential updates still stems from the advantage decomposition lemma. As for the main difference between MADDPG and HADDPG, it essentially lies in the theoretical guarantees brought by sequential update compared to simultaneous update. If you have any other confusion while reading the paper, feel free to discuss them with me.

xiaosly commented 6 months ago

Thanks for your reply. Sorry for the naive questions regarding this paper. So doe the entire paper aims to prove the theoritical guarantee under the sequential update and extend to the off policy algorithm? Since sequential update starts from TRPO and PPO as HATRPO and HAPPO while few work focused on off policy algorithm, am I understand right? Thus the main diference from the author's code compared with the conventional CTDE method is the sequential policy update method . Meanwhile, the author tried to prove that the proposed one with theoritical guarantee is solid rather than tricky modification (sequential update)? I am new in this area, that is my plain understanding.

xiaosly commented 6 months ago

Thanks for your reply. Sorry for the naive questions regarding this paper. So doe the entire paper aims to prove the theoritical guarantee under the sequential update and extend to the off policy algorithm? Since sequential update starts from TRPO and PPO as HATRPO and HAPPO while few work focused on off policy algorithm, am I understand right? Thus the main diference from the author's code compared with the conventional CTDE method is the sequential policy update method . Meanwhile, the author tried to prove that the proposed one with theoritical guarantee is solid rather than tricky modification (sequential update)? I am new in this area, that is my plain understanding.

guazimao commented 6 months ago

Sorry for the late reply. Yes, your understanding is mostly correct. Proving the theoretical guarantees of off-policy algorithms with sequential update is one of the goals of our paper. A more important objective is to incorporate drift functional and neighborhood operators into sequential update, namely the mirror learning framework, which still assures theoretical guarantees. Thus, by selecting appropriate operators, we can derive a broader range of theoretically guaranteed algorithms.

xiaosly commented 6 months ago

Thanks for your reply. What's the drawback of not applying the drift functional and neighbourhood operators? I did not catch this insight here.

guazimao commented 6 months ago

Using appropriate operators may stabilize the updates. You can take a look at the mirror learning paper, where these operators were first introduced.

xiaosly commented 6 months ago

No further questions so far, Thanks for your help!!