LuEE-C / PPO-Keras

My implementation of the Proximal Policy Optisation algorithm using Keras as a backend
88 stars 24 forks source link

Convergence #2

Closed aliostad closed 5 years ago

aliostad commented 6 years ago

Hi,

I have started using this on Lunar-Lander-v2 which is not continuous and I do not yet see convergence. Did this code converged you were testing? Also for training these simple cases, did you need minutes, hours or days of training?

Thanks for great work

Bigpig4396 commented 6 years ago

It truely not converges as I tried other environments for testing

aliostad commented 6 years ago

@Bigpig4396 so have you managed to find a keras impl of PPO? I am planning to build one soon, if not found.

Bigpig4396 commented 6 years ago

NO, there is no keras version. I have no idea why it is not converging, I think the work should work, I didnot find any logical error.

aliostad commented 6 years ago

@Bigpig4396 I do not understand the idea of using dummy value and action for part of the training. I have a feeling it might to do with it. I reviewed OpenAI's canonical PPO code (done for tf) and did not see such a thing.

Bigpig4396 commented 6 years ago

In his actor model, he will need advantage value and action distribution of old network for training. These two are set as two inputs of the network. In the training phase, advantage and action distribution are fed for calculating loss. But in prediction phase, the actor only need observation as input to predict next action, but not the advantage value and action distribution of old network. At this time these two inputs (advantage value and action distribution of old network) should be fed as zeros(which are dummy value and action).

Bigpig4396 commented 6 years ago

Some other works uses two actor models, one for new one for old. But this one keep the sampled trajectory of old actor model. Only use one new actor model. I think it is clever. I just don't know why this thing does not converge. It is expected to.

aliostad commented 6 years ago

@Bigpig4396 thanks for the explanation, it helped me to understand it better. 👍

LuEE-C commented 6 years ago

Hey, sorry for the late reply, the issue was mostly with weak exploration. I've pushed a version that converges on lunarlander-v2 in around 1000 episodes. It is however for the continuous case, for the discrete case I would try to add an exploration heuristic such as e-greedy

Bigpig4396 commented 6 years ago

hello. Could please help to write a discrete verion. I have a environment observation as a vector of size (1, 8) integer and output 5 actions indexed by integer 0-4. If you can provide this version, it will be greatly helpful!!

aliostad commented 6 years ago

Same here. I have a discrete environment with a big action space and I appreciate if you could get the discrete working.

LuEE-C commented 6 years ago

I'll give it a go, but for discrete problems DQN and it's variants tend to outperform on-policy algorithms like PPO. If it is an option, I would recommend using Rainbow https://arxiv.org/abs/1710.02298 or parts of it. As for the current implementation, it converges correctly on very simple discrete problems (i.e. cartpole) but has issues on more complex ones. I believe it comes down to finetuning its hyperparameters (entropy, lr, batch_size) and finding an appropriate exploration scheme. I'll try to play around and find working hyperparams/exploration for discrete lunar lander to have a good discrete baseline

aliostad commented 6 years ago

Thanks. Gentle reminder, I have tried discrete LunarLander-v2 with OpenAI's implementation of PPO (in tf) and it converges pretty quickly - hence it is able to solve more complex discrete problems too. I would love to be able to try it for much more complex, sadly their impl does not lend itself to my current setup.

LuEE-C commented 6 years ago

Ah, then I know where to start looking, thank you

aliostad commented 6 years ago

I don't realy want to pester but was wondering if you had the time to look into it.

LuEE-C commented 6 years ago

Gave it a quick look and I do have problems with convergence on discrete problems. The open ai implementations has many tricks that my implementation does not have (advantage normalization, GAE to name a few). Or maybe there is a bug in my loss function. I don't have time at the moment but next week I should have time to debug everything and hopefully get convergence

aliostad commented 6 years ago

@OctThe16th Thank you! I appreciate you taking the time to look into it. I am in no rush now to use it anymore (I have used another algo for now) but I will be waiting patiently to try it whenever you get the time to make it ready.

Considering currently there is no Keras implementation available, your work will be the first one.

alexmosc commented 6 years ago

It does not converge on a discrete action problem.

aliostad commented 6 years ago

@alexmosc I am also working on porting PPO from tf to keras but @OctThe16th could finish quicker.

alexmosc commented 6 years ago

@alexmosc I am also working on porting PPO from tf to keras but @OctThe16th could finish quicker.

Thanks. I will try to debug the loss function. It looks correct, but I noticed that it does not converge even on one example, so there can be internal problem related to tf or keras.

aliostad commented 5 years ago

@alexmosc a bug in tf or keras will surprise me greatly but obviously is a possibility.

LuEE-C commented 5 years ago

Hey guys, sorry it took so long, there were some dimensionality issues when I was summing the probabilities in the discrete loss function, I have pushed a fix that converges on discrete lunar lander among other things, thanks for pointing that issue out!

aliostad commented 5 years ago

@OctThe16th AWESOME! Can't wait to test it!

aliostad commented 5 years ago

Hi,

I have tried running it but frankly I do not see it converging similar to even DQN, let alone OpenAPI's PPO.

Not sure if I am doing anything wrong but essentially just running the main.py straight with no modification (only adding self.env.render() to view the Lunar Lander)

LuEE-C commented 5 years ago

Your right, my previous fix only made it slightly better, I removed all summation and it converges must faster on CartPole (dont have Box2d currently) then with it, I believe this latest push should have fixed the loss function for good but I can't test it on lunar lander right now, let me know

aliostad commented 5 years ago

OK, if you do not mind I will open it again. However, if you believe you will not be able to work on it.

LuEE-C commented 5 years ago

No issues, if it isn't working that's on me. I'll get it going for good after NIPS

LuEE-C commented 5 years ago

image Hey, this is my current result on LunarLander-V2. Are the results you obtain with the baseline significantly better?

(I changed the code since last time)

aliostad commented 5 years ago

It is definitely better but not as robust as DQN or OpenAI. I am happy to close if that is the intended performance.

LuEE-C commented 5 years ago

Well I might be able to change the advantages calculation to GAE and make a few more changes to exploration to match the open ai implementation more closely, but I feel like the loss function is now correct and the difference is now caused by specific implementation details that I don't plan on going over anytime soon

aliostad commented 5 years ago

Sure. Thanks for the work. I believe now that it converges this issue must be closed.

hoangcuong2011 commented 5 years ago

Hi there!

Thanks for your great work plus discussion.

I have a question - when it comes to discrete actions and large action space, is there any implementation of Keras for RL with PPO that works well? If not, is there any implementation of Keras for RL with DQN that works well?

I asked because it seems like this project is the BEST Keras implementation of PPO, and it seems like there is still a gap in performance between this one and the original openAI code (in TF).

Again, thanks for your great work!

@aliostad @OctThe16th @alexmosc: many thanks!