Hyperparameter tuning & other to prevent divergence

JalterMain commented 2 years ago

-> Policy diverges quickly. As gradients have been fixed (hopefully), main suspects are probably one of these (or a combination):

Policy learning rate & value function learning rate (currently 0.000003 & 0.00001 respectively w/ Adam)
Policy network size & value network size (policy: 111, 32, 8 + tanh, value: 118, 32, 1)
Mini-batch size (currently 256) and min buffer size to begin training (currently 1000)
Beta, constant for adding gaussian noise for exploration (currently 0.2), and alpha, for target policy smoothing (currently 0.2)
Gamma, discount factor (currently 0.99)

JalterMain commented 2 years ago

Seems after changing network size hidden layers from 32, 32 to 400, 300 causes convergence. Also, beta & alpha -> 0.1, batch_size -> 100, both lr -> 0.0001, but I feel that the disparity between network nodes is the main reason why divergence occurred. Still need to fully track the stats to see though.

JalterMain commented 2 years ago

A lot of instability during training, even in later stages. Seems like a clear indicator of overestimation bias.

JalterMain / DDPG_Antv2