Open Armandpl opened 5 months ago
Ok so two things with gSDE:
When training with SAC+gSDE in sb3, if there is no wrapper to avoid the motor dead zone (roughly -4V, +4V) during the first episodes the action is too small and doesn't make the motor move. Then it suddenly becomes too big and that's a problem because the motor spins too fast and we terminate the episode.
Dreamer uses entropy regularization to control exploration. However looking at the code for gsde in sb3, it doesn't seem like we can't compute entropy if we bound the action + noise using tanh.
Debug plan for gSDE:
Ok so even before gsde, it seems I can train on pendulum-v1 and get it to converge with continuous actoins but it doesn't work on FurutaSim-v0. Discretizing the action worked so I'm thinking there is an issue with the gradient computation for the continuous action. Maybe it is an issue with the way we compute and sum the entropy?
Looking at the distribution of the actions, its value is between -2 and 3 which is too much.
Ok so I'm trying to get continuous actions to work on the FurutaSim-v0
env. It works perfectly if we discretize the action space wandb run
Continuous actions also work on the Pendulum-v1
env with or without gsde. With gSDE it looks like the action is either ~1.0
or ~1.0
while the distribution is smoother without. With gSDE I haven't logged and can't remember if I squashed the actions with tanh and estimated the entropy with -log_prob.mean()
or if I clipped the action and used Normal(mean, std).entropy()
But I'm thinking it is an entropy regularization issue because the distrib of actions looks pretty much binary which means the agent always chooses actions that are too big and in the case of the pendulum this violates speeds limits pretty quickly. In the case of discrete actions when we regularize entropy the agent takes actions that are way different. In the case of a normal distrib the entropy depends only on the std? so it makes sure there is some exploration but if the agent always chooses the mean to be -1 or 1 it doesn't solve the problem??
I tried using gSDE without the entropy regularizer but ended up having the same issue of actions being 'too big'. I stopped terminating the episode when the speed was too high and the episodes started getting longer but it still didn't converge to a working policy and again the actions were too big. wandb run. I also tried multiplying the action by 0.55
bc I thought maybe there isn't enough friction in the sim or something but that's dumb bc it work with discrete actions.
Can it be another issue? Maybe the way we pass gradients? Should the std for the gsde noise be bigger to explore more? Should we clip the mean of the policy lower? currently we clip it at 2.0 for numerical stability but tanh 2.0 is 1.0
pendulum wandb run entropy estimated with log prob mean bc we use tanh: https://wandb.ai/armandpl/minidream_dev/runs/f6cdz43u?nw=nwuserarmandpl pendulum wandb run entropy estimated w/ Normal dist, clipping the action, not using tanh: pendulum w/ higher init std pendulum lower clip_mean
maybe try another continuous env e.g BipedalWalker maybe it'll bug in a different way that's going to be informative
register env with gym