Open zhx0506 opened 3 months ago
Thank you for your interest in our work.
We did explore applying ICES to on-policy algorithms like MAPPO when we were first developing the method. Specifically, we trained two separate critics with MAPPO - one centralized critic guided by the extrinsic reward, and one decentralized critic guided by the intrinsic scaffold we constructed. We then combined the values from both critics (relying more on intrinsic early on and shifting to extrinsic later in training) to train a single actor, following MAPPO's on-policy approach.
In our experiments, ICES did provide faster convergence for MAPPO in sparse reward settings. However, in terms of sample efficiency, on-policy methods were still not as effective as off-policy algorithms where each sample can be reused multiple times. Since sample efficiency was crucial for the sparse reward environments we focused on, we did not include MAPPO results in the paper.
Overall, ICES can bring gains to on-policy methods, but off-policy algorithms remain superior for sample efficiency. Nevertheless, we hope this provides some insight into how ICES could extend to policy gradient approaches like MAPPO. Please let me know if you have any other questions!
这是来自QQ邮箱的假期自动回复邮件。您好!您的邮件我已收到,谢谢
First of all, thank you for your team's contribution on ICES. I would like to ask if it is applicable for methods based on policy gradients?
For example, MAPPO, lies in the fact that only actions obtained from sampling are used when interacting with the environment, instead of greedy policies.