Armandpl / dreamerv3

DreamerV3 + gSDE, using pytorch, on a real robot
1 stars 0 forks source link

prepare for training on real robot #11

Open Armandpl opened 5 months ago

Armandpl commented 5 months ago
Armandpl commented 4 months ago

Ok so two things with gSDE:

Armandpl commented 4 months ago
Armandpl commented 4 months ago

Debug plan for gSDE:

Armandpl commented 4 months ago

Ok so even before gsde, it seems I can train on pendulum-v1 and get it to converge with continuous actoins but it doesn't work on FurutaSim-v0. Discretizing the action worked so I'm thinking there is an issue with the gradient computation for the continuous action. Maybe it is an issue with the way we compute and sum the entropy?

Looking at the distribution of the actions, its value is between -2 and 3 which is too much.

Armandpl commented 3 months ago

Ok so I'm trying to get continuous actions to work on the FurutaSim-v0 env. It works perfectly if we discretize the action space wandb run Continuous actions also work on the Pendulum-v1 env with or without gsde. With gSDE it looks like the action is either ~1.0 or ~1.0 while the distribution is smoother without. With gSDE I haven't logged and can't remember if I squashed the actions with tanh and estimated the entropy with -log_prob.mean() or if I clipped the action and used Normal(mean, std).entropy()

But I'm thinking it is an entropy regularization issue because the distrib of actions looks pretty much binary which means the agent always chooses actions that are too big and in the case of the pendulum this violates speeds limits pretty quickly. In the case of discrete actions when we regularize entropy the agent takes actions that are way different. In the case of a normal distrib the entropy depends only on the std? so it makes sure there is some exploration but if the agent always chooses the mean to be -1 or 1 it doesn't solve the problem??

I tried using gSDE without the entropy regularizer but ended up having the same issue of actions being 'too big'. I stopped terminating the episode when the speed was too high and the episodes started getting longer but it still didn't converge to a working policy and again the actions were too big. wandb run. I also tried multiplying the action by 0.55 bc I thought maybe there isn't enough friction in the sim or something but that's dumb bc it work with discrete actions.

Can it be another issue? Maybe the way we pass gradients? Should the std for the gsde noise be bigger to explore more? Should we clip the mean of the policy lower? currently we clip it at 2.0 for numerical stability but tanh 2.0 is 1.0

pendulum wandb run entropy estimated with log prob mean bc we use tanh: https://wandb.ai/armandpl/minidream_dev/runs/f6cdz43u?nw=nwuserarmandpl pendulum wandb run entropy estimated w/ Normal dist, clipping the action, not using tanh: pendulum w/ higher init std pendulum lower clip_mean

maybe try another continuous env e.g BipedalWalker maybe it'll bug in a different way that's going to be informative