Closed zencoding closed 7 years ago
Hi,
I would recommend using the same network used in the implementation you have linked. They use a learning rate of 0.005, so I would try using a learning rate of 0.005*max_local_steps*environment_count/(8*32).
When debugging the model I plot batch statistics. Mean, max, min and std for the value estimates, advantages, critic gradient, actor gradient. Also the average policy distribution. This may help you figure out why the model is not converging. Does it diverge after a large gradient? Does the policy become deterministic too quickly? Does the value function never distinguish between states?
I changed the network to the linked CartPole A3C implementation and lr to the value you suggested (which in my case of defaults is 0.00325), the behavior doesn't change much, the reward doesn't change much.
Regarding your other questions, the Clipped Mean does tend to go to zero but the raw gradients are very high, attached are some screenshots.
You could try removing the gradient clipping and finding a new learning rate. Gradient clipping helps when your gradients are usually within some range, and then suddenly spike causing divergence.
This implementation uses global norm clipping, so could also try plotting the global norm before clipping. Then you can adjust the gradient clipping to slightly higher than the "normal/average" global norm during training.
I found the problem, paac was built to work with images and hence you have states with uint8. CartPole has state that needs float. I changed it and also removed Gradient clipping, reward clipping to test. Unfortunately still the same rewards, the graph looks good for raw gradients (almost zero) but the loss is not going to zero. Don't know what else to change, I will try adding randomness in the action selection (epsilon greedy). BTW, my code is at https://github.com/zencoding/paac/tree/cartpole, if you want to check. It is a hacked version just to make it work with CartPole
I got it working, there was no issue with your code or learning rate, it was just some type (uint8) conversion problem. I guess, lesson learned to check for data types in the processing. Thanks for your help, I will close this issue.
BTW, do you think this will also work with MountainCar, I had MountainCar with A3C implementation with recommendations in Reddit post https://www.reddit.com/r/MachineLearning/comments/67fqv8/da3c_performs_badly_in_mountain_car/, still could not get it working. The observation was that MountainCar is a classic case for Off-Policy learning as it needs extensive exploration and On-Policy methods such as A3C or paac will not work very well. Any tips to get it working on paac since I noticed that in your paper you indicated that paac can be made to work with off-policy methods?
Good to hear that you got it working. I have experimented with experience replay, using retrace and truncated importance sampling as in the ACER paper so you could try that. However it seems the issue with mountain car is exploration, you could try increasing the entropy loss constant or adding action repeat.
@zencoding were you finally successful with learning PAAC on cartpole env? I've checked your branch - still it doesn't demonstrate learning on CartPole env. I attach Tensorboards below - still puzzled what seems to be the issue with PAAC not learning CartPole...
Yes, I got it working. It was a simple fix, change the state type from uint8 to float32. Search for uint8 in paac.py and change any occurrence of it, my branch might not have all changes since I did not update it.
Hi, Thanks for the great implementation, I am currently learning RL and I am trying to adapt paac for a simple use case of CartPole. I made modifications to paac code to include a new environment for CartPole and also modified network for NIPS network to a simple Linear Network. In essence, I am trying to reproduce the A3C implementation of CartPole from https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py. Running paac on CartPole never seems to be converging to higher rewards, the Maximum reward I get is around 30. I understand that every environment needs tuning of the hyperparameters but I don't know what else can I try to make it work for a Simple use case of CartPole. The reference implementation of A3C at https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py coverages to successful rewards after few thousand steps but the paac implementation never moves beyond 30. Can you recommend anything else I can do to make it work or am I missing any fundamental settings? The changes I have already tried are
The paac model is capable of solving much more complicated environments and I am suprised that it is struggling with the classic and simplest Cartpole problem. I expected paac to solve the CartPole problem much faster than CPU based A3C
Thanks in advance