Closed aliostad closed 5 years ago
It truely not converges as I tried other environments for testing
@Bigpig4396 so have you managed to find a keras impl of PPO? I am planning to build one soon, if not found.
NO, there is no keras version. I have no idea why it is not converging, I think the work should work, I didnot find any logical error.
@Bigpig4396 I do not understand the idea of using dummy value and action for part of the training. I have a feeling it might to do with it. I reviewed OpenAI's canonical PPO code (done for tf
) and did not see such a thing.
In his actor model, he will need advantage value and action distribution of old network for training. These two are set as two inputs of the network. In the training phase, advantage and action distribution are fed for calculating loss. But in prediction phase, the actor only need observation as input to predict next action, but not the advantage value and action distribution of old network. At this time these two inputs (advantage value and action distribution of old network) should be fed as zeros(which are dummy value and action).
Some other works uses two actor models, one for new one for old. But this one keep the sampled trajectory of old actor model. Only use one new actor model. I think it is clever. I just don't know why this thing does not converge. It is expected to.
@Bigpig4396 thanks for the explanation, it helped me to understand it better. 👍
Hey, sorry for the late reply, the issue was mostly with weak exploration. I've pushed a version that converges on lunarlander-v2 in around 1000 episodes. It is however for the continuous case, for the discrete case I would try to add an exploration heuristic such as e-greedy
hello. Could please help to write a discrete verion. I have a environment observation as a vector of size (1, 8) integer and output 5 actions indexed by integer 0-4. If you can provide this version, it will be greatly helpful!!
Same here. I have a discrete environment with a big action space and I appreciate if you could get the discrete working.
I'll give it a go, but for discrete problems DQN and it's variants tend to outperform on-policy algorithms like PPO. If it is an option, I would recommend using Rainbow https://arxiv.org/abs/1710.02298 or parts of it. As for the current implementation, it converges correctly on very simple discrete problems (i.e. cartpole) but has issues on more complex ones. I believe it comes down to finetuning its hyperparameters (entropy, lr, batch_size) and finding an appropriate exploration scheme. I'll try to play around and find working hyperparams/exploration for discrete lunar lander to have a good discrete baseline
Thanks. Gentle reminder, I have tried discrete LunarLander-v2 with OpenAI's implementation of PPO (in tf) and it converges pretty quickly - hence it is able to solve more complex discrete problems too. I would love to be able to try it for much more complex, sadly their impl does not lend itself to my current setup.
Ah, then I know where to start looking, thank you
I don't realy want to pester but was wondering if you had the time to look into it.
Gave it a quick look and I do have problems with convergence on discrete problems. The open ai implementations has many tricks that my implementation does not have (advantage normalization, GAE to name a few). Or maybe there is a bug in my loss function. I don't have time at the moment but next week I should have time to debug everything and hopefully get convergence
@OctThe16th Thank you! I appreciate you taking the time to look into it. I am in no rush now to use it anymore (I have used another algo for now) but I will be waiting patiently to try it whenever you get the time to make it ready.
Considering currently there is no Keras implementation available, your work will be the first one.
It does not converge on a discrete action problem.
@alexmosc I am also working on porting PPO from tf to keras but @OctThe16th could finish quicker.
@alexmosc I am also working on porting PPO from tf to keras but @OctThe16th could finish quicker.
Thanks. I will try to debug the loss function. It looks correct, but I noticed that it does not converge even on one example, so there can be internal problem related to tf or keras.
@alexmosc a bug in tf or keras will surprise me greatly but obviously is a possibility.
Hey guys, sorry it took so long, there were some dimensionality issues when I was summing the probabilities in the discrete loss function, I have pushed a fix that converges on discrete lunar lander among other things, thanks for pointing that issue out!
@OctThe16th AWESOME! Can't wait to test it!
Hi,
I have tried running it but frankly I do not see it converging similar to even DQN, let alone OpenAPI's PPO.
Not sure if I am doing anything wrong but essentially just running the main.py straight with no modification (only adding self.env.render()
to view the Lunar Lander)
Your right, my previous fix only made it slightly better, I removed all summation and it converges must faster on CartPole (dont have Box2d currently) then with it, I believe this latest push should have fixed the loss function for good but I can't test it on lunar lander right now, let me know
OK, if you do not mind I will open it again. However, if you believe you will not be able to work on it.
No issues, if it isn't working that's on me. I'll get it going for good after NIPS
Hey, this is my current result on LunarLander-V2. Are the results you obtain with the baseline significantly better?
(I changed the code since last time)
It is definitely better but not as robust as DQN or OpenAI. I am happy to close if that is the intended performance.
Well I might be able to change the advantages calculation to GAE and make a few more changes to exploration to match the open ai implementation more closely, but I feel like the loss function is now correct and the difference is now caused by specific implementation details that I don't plan on going over anytime soon
Sure. Thanks for the work. I believe now that it converges this issue must be closed.
Hi there!
Thanks for your great work plus discussion.
I have a question - when it comes to discrete actions and large action space, is there any implementation of Keras for RL with PPO that works well? If not, is there any implementation of Keras for RL with DQN that works well?
I asked because it seems like this project is the BEST Keras implementation of PPO, and it seems like there is still a gap in performance between this one and the original openAI code (in TF).
Again, thanks for your great work!
@aliostad @OctThe16th @alexmosc: many thanks!
Hi,
I have started using this on Lunar-Lander-v2 which is not continuous and I do not yet see convergence. Did this code converged you were testing? Also for training these simple cases, did you need minutes, hours or days of training?
Thanks for great work