Question on algorithm itself

alexis-jacq / Pytorch-DPPO

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

MIT License

180 stars 40 forks source link

Question on algorithm itself #8

Open QiXuanWang opened 5 years ago

QiXuanWang commented 5 years ago

Usually PPO is for continous action, but for OpenAI FIVE, shouldn't the action be discrete? What's the technique to make PPO applicable to Dota2 actions?

alexis-jacq commented 5 years ago

PPO is just a trick to regularize policies updates, that can be use for any kind of state/action spaces. if action are discrete, just replace the multivariate Gaussian by a softmax distribution over actions. It would just change the way log_probs and entropies are computed. You have an example of PPO implementation with both options for discrete and continuous actions here: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/model.py

QiXuanWang commented 5 years ago

Thanks for this. Laterly I found some information which states similar as yours. Thanks for the pointing.