PPO on continuous actions

facebookresearch / Pearl

A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.

MIT License

2.69k stars 165 forks source link

PPO on continuous actions #77

Open zaksemenov opened 8 months ago

zaksemenov commented 8 months ago

I noticed that in the PPO agent initialization it forces the is_action_continuous=False whereas the PPO algorithm and other libraries implementing PPO allow continuous actions. Can this be added to Pearl as well

https://github.com/facebookresearch/Pearl/blob/main/pearl/policy_learners/sequential_decision_making/ppo.py#L99

yiwan-rl commented 8 months ago

Thanks for your suggestion. We actually plan to add continuous control PPO soon.

kuds commented 3 months ago

Following up on this issue. Do you have an ETA for when this feature might be implemented? I would be interested in contributing to this if possible.

rodrigodesalvobraz commented 3 months ago

Following up on this issue. Do you have an ETA for when this feature might be implemented? I would be interested in contributing to this if possible.

Continuous action PPO is scheduled for sometime towards the end of the year. If you implement it yourself we would be delighted to accept the contribution!

kuds commented 3 months ago

@rodrigodesalvobraz

Awesome! Thanks for the opportunity! I will start working on this and add any development updates/questions to this issue.

rodrigodesalvobraz commented 3 months ago

@rodrigodesalvobraz

Awesome! Thanks for the opportunity! I will start working on this and add any development updates/questions to this issue.

BTW, please don't forget to check CONTRIBUTING.md for important information on contributing to the project.

kuds commented 3 months ago

@rodrigodesalvobraz

Quick question, did you mean to close this issue, or does Pearl support PPO with continuous action spaces so it can be closed?

rodrigodesalvobraz commented 3 months ago

Oops, sorry, I didn't mean to close the issue. Thanks for pointing it out. No, Pearl still does not support PPO with continuous action spaces. Thanks.

kuds commented 3 months ago

@rodrigodesalvobraz

Development Update

I have spent the last few days better understanding Pearl and the different modules (replay buffer, policy learner, etc.). I also got PPO for discrete action spaces working in two Gymnasium environments (CartPole-v1 & LunarLander-v2). The implementation of PPO for continuous action spaces is coded, and I am currently troubleshooting some bugs. I plan to be wrapped up with this development in early September.

Next Steps

[ ] Finish implementation of PPO for continuous action spaces
[ ] Add new unit tests (if needed)
[ ] Create a tutorial to demonstrate using PPO in discrete and continuous action spaces

Questions

Should PPO for continuous action spaces be broken into its own file like SAC?
Why do the baseline models use tanh layers between the fully connected layers instead of ReLU? Just preference?

rodrigodesalvobraz commented 3 months ago

Good to hear of your progress, @kuds.

Questions

Should PPO for continuous action spaces be broken into its own file like SAC?

Yes, please.

Why do the baseline models use tanh layers between the fully connected layers instead of ReLU? Just preference?

I could only find it being used in VanillaContinuousActorNetwork. Since that is the last layer, it is probably so the output is normalized (ReLU wouldn't guarantee that).

kuds commented 2 months ago

@rodrigodesalvobraz

Development Update

I finished working through the bugs for PPO in continuous action spaces. I am cleaning up my changes and adding new unit tests for the ContinuousProximalPolicyOptimization class. I should have the pull request submitted sometime early next week.

Next Steps

[x] Finish implementation of PPO for continuous action spaces
[ ] Add new unit tests
[x] Create a tutorial to demonstrate using PPO in discrete and continuous action spaces

Questions

Why does the PPO for discrete action spaces sum the losses for the actor-network instead of taking the mean/average?

pearl/policy_learners/sequential_decision_making/ppo.py on line 131 loss = torch.sum(-torch.min(r_theta * batch.gae, clip * batch.gae))
As part of this implementation, should I normalize the generalized advantage estimation (gae) at the batch level before applying it to the clipped loss?

yiwan-rl commented 2 months ago

Hi Kuds, I think sum and mean both work if one uses optimizers that normalize the gradient such as Adam and RMSprop. But mean seems to be better if one uses SGD. Ideally, GAE normalization should be provided as an option and is applied in actor loss computation. Thanks for your work!