astooke / rlpyt

Reinforcement Learning in PyTorch
MIT License
2.23k stars 325 forks source link

log_std exploding in GaussianPgAgent (MujocoFfAgent) #124

Open kzorina opened 4 years ago

kzorina commented 4 years ago

Hello

Thanks for a great library! I want to apply PPO implementation to my own environment. I am using MujocoFfAgent and encountered error that I cannot fix. Maybe you can help me to understand, where I should look?

The problem was that my actions were going to infinity (If I clip them - always min/max values). When I debugged - I found that in GaussianPgAgent.step function log_std values increase over time. Is there a way to limit log_std values? Or if they go up means that have an error in other place?

Some values from GaussianPgAgent.step function: start

observation[0] = {Tensor: 4} tensor([0.7083, 0.2566, 0.9929, 0.0000])
prev_action[0] = {Tensor: 9} tensor([ 0.7453, -0.6235,  1.8746, -0.6544, -1.5104,  0.2538,  0.4086,  0.1417,\n        -1.7879])
prev_reward[0] = {Tensor} tensor(5.1418e-13)
mu[0] = {Tensor: 9} tensor([-0.0932,  0.1392,  0.0318, -0.0225,  0.3914,  0.2643,  0.0584,  0.0474,\n        -0.2305])
log_std[0] = {Tensor: 9} tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.])

n_itr = 625000

observation[0] = {Tensor: 4} tensor([1.2500, 0.2566, 0.9929, 0.0000])
prev_action[0] = {Tensor: 9} tensor([-2.0000, -1.0901,  2.0000, -2.0000,  1.1744, -2.0000,  2.0000, -2.0000,\n        -2.0000])
prev_reward[0] = {Tensor} tensor(1.4636e-12)
log_std[0] = {Tensor: 9} tensor([2.3225, 2.3136, 2.3122, 2.3070, 2.3241, 2.3124, 2.3160, 2.3174, 2.3157])
mu[0] = {Tensor: 9} tensor([-0.9720,  0.9817,  0.9847, -0.8975,  0.9966,  0.9766,  0.9882, -0.9092,\n         0.9782])
astooke commented 4 years ago

Hmmm, I don't have a full answer for this because it's specifics of one RL problem...but one thing that might help is to clip the actions inside the environment, but not in the agent, so that gradients still flow for those actions...I think I've run into that issue before. Or if you have a large entropy bonus on? (but probably you already have that turned off.)

You could also try clipping the log_std, in the Gaussian distribution you can input a min and max for that. Although I'm not sure exactly what that would do to the learning when it is pushing up against the limit.

Good luck and let us know if anything works!

jordan-schneider commented 4 years ago

What optimizer/learning rate are you using? If your effective learning rate is too high, you might just be diverging.