Open zaksemenov opened 8 months ago
Thanks for your suggestion. We actually plan to add continuous control PPO soon.
Following up on this issue. Do you have an ETA for when this feature might be implemented? I would be interested in contributing to this if possible.
Following up on this issue. Do you have an ETA for when this feature might be implemented? I would be interested in contributing to this if possible.
Continuous action PPO is scheduled for sometime towards the end of the year. If you implement it yourself we would be delighted to accept the contribution!
@rodrigodesalvobraz
Awesome! Thanks for the opportunity! I will start working on this and add any development updates/questions to this issue.
@rodrigodesalvobraz
Awesome! Thanks for the opportunity! I will start working on this and add any development updates/questions to this issue.
BTW, please don't forget to check CONTRIBUTING.md
for important information on contributing to the project.
@rodrigodesalvobraz
Quick question, did you mean to close this issue, or does Pearl support PPO with continuous action spaces so it can be closed?
Oops, sorry, I didn't mean to close the issue. Thanks for pointing it out. No, Pearl still does not support PPO with continuous action spaces. Thanks.
@rodrigodesalvobraz
I have spent the last few days better understanding Pearl and the different modules (replay buffer, policy learner, etc.). I also got PPO for discrete action spaces working in two Gymnasium environments (CartPole-v1 & LunarLander-v2). The implementation of PPO for continuous action spaces is coded, and I am currently troubleshooting some bugs. I plan to be wrapped up with this development in early September.
Good to hear of your progress, @kuds.
Questions
- Should PPO for continuous action spaces be broken into its own file like SAC?
Yes, please.
- Why do the baseline models use tanh layers between the fully connected layers instead of ReLU? Just preference?
I could only find it being used in VanillaContinuousActorNetwork
. Since that is the last layer, it is probably so the output is normalized (ReLU wouldn't guarantee that).
@rodrigodesalvobraz
I finished working through the bugs for PPO in continuous action spaces. I am cleaning up my changes and adding new unit tests for the ContinuousProximalPolicyOptimization
class. I should have the pull request submitted sometime early next week.
Why does the PPO for discrete action spaces sum the losses for the actor-network instead of taking the mean/average?
pearl/policy_learners/sequential_decision_making/ppo.py on line 131
loss = torch.sum(-torch.min(r_theta * batch.gae, clip * batch.gae))
As part of this implementation, should I normalize the generalized advantage estimation (gae) at the batch level before applying it to the clipped loss?
Hi Kuds, I think sum and mean both work if one uses optimizers that normalize the gradient such as Adam and RMSprop. But mean seems to be better if one uses SGD. Ideally, GAE normalization should be provided as an option and is applied in actor loss computation. Thanks for your work!
I noticed that in the PPO agent initialization it forces the
is_action_continuous=False
whereas the PPO algorithm and other libraries implementing PPO allow continuous actions. Can this be added to Pearl as wellhttps://github.com/facebookresearch/Pearl/blob/main/pearl/policy_learners/sequential_decision_making/ppo.py#L99