Adapting paac for CartPole

zencoding commented 7 years ago

Hi, Thanks for the great implementation, I am currently learning RL and I am trying to adapt paac for a simple use case of CartPole. I made modifications to paac code to include a new environment for CartPole and also modified network for NIPS network to a simple Linear Network. In essence, I am trying to reproduce the A3C implementation of CartPole from https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py. Running paac on CartPole never seems to be converging to higher rewards, the Maximum reward I get is around 30. I understand that every environment needs tuning of the hyperparameters but I don't know what else can I try to make it work for a Simple use case of CartPole. The reference implementation of A3C at https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py coverages to successful rewards after few thousand steps but the paac implementation never moves beyond 30. Can you recommend anything else I can do to make it work or am I missing any fundamental settings? The changes I have already tried are

Change learning rate, lower learning rates seem to do better
Change network model to multiple layers 128->64->16 (with relu) and other configuration 512->256->128->64->16
Run it for longer duration ( more than 30 mins)
Change entropy to higher value (this one actually causes NaN's in gradients)

The paac model is capable of solving much more complicated environments and I am suprised that it is struggling with the classic and simplest Cartpole problem. I expected paac to solve the CartPole problem much faster than CPU based A3C

Thanks in advance

Alfredvc commented 7 years ago

Hi,

I would recommend using the same network used in the implementation you have linked. They use a learning rate of 0.005, so I would try using a learning rate of 0.005*max_local_steps*environment_count/(8*32).

When debugging the model I plot batch statistics. Mean, max, min and std for the value estimates, advantages, critic gradient, actor gradient. Also the average policy distribution. This may help you figure out why the model is not converging. Does it diverge after a large gradient? Does the policy become deterministic too quickly? Does the value function never distinguish between states?

zencoding commented 7 years ago

I changed the network to the linked CartPole A3C implementation and lr to the value you suggested (which in my case of defaults is 0.00325), the behavior doesn't change much, the reward doesn't change much.

Regarding your other questions, the Clipped Mean does tend to go to zero but the raw gradients are very high, attached are some screenshots.

screen shot 2017-07-13 at 3 00 00 pm screen shot 2017-07-13 at 2 59 38 pm

Alfredvc commented 7 years ago

You could try removing the gradient clipping and finding a new learning rate. Gradient clipping helps when your gradients are usually within some range, and then suddenly spike causing divergence.

This implementation uses global norm clipping, so could also try plotting the global norm before clipping. Then you can adjust the gradient clipping to slightly higher than the "normal/average" global norm during training.

zencoding commented 7 years ago

I found the problem, paac was built to work with images and hence you have states with uint8. CartPole has state that needs float. I changed it and also removed Gradient clipping, reward clipping to test. Unfortunately still the same rewards, the graph looks good for raw gradients (almost zero) but the loss is not going to zero. Don't know what else to change, I will try adding randomness in the action selection (epsilon greedy). BTW, my code is at https://github.com/zencoding/paac/tree/cartpole, if you want to check. It is a hacked version just to make it work with CartPole screen shot 2017-07-14 at 4 06 23 pm

zencoding commented 7 years ago

I got it working, there was no issue with your code or learning rate, it was just some type (uint8) conversion problem. I guess, lesson learned to check for data types in the processing. Thanks for your help, I will close this issue.

BTW, do you think this will also work with MountainCar, I had MountainCar with A3C implementation with recommendations in Reddit post https://www.reddit.com/r/MachineLearning/comments/67fqv8/da3c_performs_badly_in_mountain_car/, still could not get it working. The observation was that MountainCar is a classic case for Off-Policy learning as it needs extensive exploration and On-Policy methods such as A3C or paac will not work very well. Any tips to get it working on paac since I noticed that in your paper you indicated that paac can be made to work with off-policy methods?

Alfredvc commented 7 years ago

Good to hear that you got it working. I have experimented with experience replay, using retrace and truncated importance sampling as in the ACER paper so you could try that. However it seems the issue with mountain car is exploration, you could try increasing the entropy loss constant or adding action repeat.

captify-alazorenko commented 7 years ago

@zencoding were you finally successful with learning PAAC on cartpole env? I've checked your branch - still it doesn't demonstrate learning on CartPole env. I attach Tensorboards below - still puzzled what seems to be the issue with PAAC not learning CartPole... cartpole_fail

zencoding commented 7 years ago

Yes, I got it working. It was a simple fix, change the state type from uint8 to float32. Search for uint8 in paac.py and change any occurrence of it, my branch might not have all changes since I did not update it.

Alfredvc / paac

Adapting paac for CartPole #1