prepare for training on real robot

Armandpl commented 5 months ago

[x] add support for continuous actions
[ ] implement gsde
- make it work even when a zero action doesn't do anything? or use a wrapper mapping actions instead?
- look at torchrl and sb3 for reference implementations
[ ] train on FurutaSim and try to go under 50k simulated steps
- run a sweep with various train ratio values e.g 32, 64, 256, 1024 and gauge sample effciency
[x] fix dependencies issue in furuta package
[ ] bench control freq on my laptop and on my mac mini
- add episodic training args! -> will be useful to have a large training ratio
[ ] try training in sim w/ mps on mac mini, see if we run into any issues
[ ] add dt from the arduino to the real robot obs and add the option to not compute the velocities
- try training w/o the velocities in sim and compare
[ ] ~~register env with gym~~
[ ] when using episodic training, do 1 gradient step for every step that was taken in the episode
- to keep a contant real/sim step ratio! else the timelimit is going to affect the training dynamics

Armandpl commented 4 months ago

Ok so two things with gSDE:

When training with SAC+gSDE in sb3, if there is no wrapper to avoid the motor dead zone (roughly -4V, +4V) during the first episodes the action is too small and doesn't make the motor move. Then it suddenly becomes too big and that's a problem because the motor spins too fast and we terminate the episode.
- I don't know if that's because we learn the std so maybe it gets too big? maybe we should hand tune it?
- or maybe it's the policy only that ends up learning too big of an action?
- in either case I guess we need to log those things
- maybe a simpler fix would be to limit the max action though? but can't the algo find it?
Dreamer uses entropy regularization to control exploration. However looking at the code for gsde in sb3, it doesn't seem like we can't compute entropy if we bound the action + noise using tanh.
- how does it work when using SAC + gSDE in sb3?? maybe it is not using entropy so we can use tanh?
- is the logged std the gsde std?

Armandpl commented 4 months ago

[x] log gsde std
[x] find way to compute gsde entropy to use entropy regularization
[x] clip the policy mean to avoid numerical instability like they do in sb3
[ ] setup training on the real robot without the deadzone wrapper and see what the gsde std does, does it get very big very quick?
[x] compute gsde noise from policy features? bc zt is discrete that might make the gsde noise less less 'continuous'?
[ ] double check sheeprl implem, see if we detach tensors like they do.
[ ] double check if there are ops we can avoid doing twice
[x] double check what the input to the simulation is, is it voltage or torque. send the right action, roughly sys id the robot so that the simulation makes some sort of sense
- indeed the simulation is missing the reduction ratio
- try sim2real again to roughly gauge the simulation accuracy
- add dt to sim and robot
- randomize dt in sim

Armandpl commented 4 months ago

Debug plan for gSDE:

[ ] read code to understand what ent_coef does, is it impacting the scale of the exploration action? im guessing only the entropy as per the name
[ ] check if we're using gaussian noise when not using use_sde_at_warmup? would be nice to use sim trained policy for exploration on the robot
[ ] setup SAC training on real robot, sde at warmup, learning_starts=0
- compare gsde std to previous successful training run, see if std gets too big too fast
[ ] try tuning the std so that the initial noise is a bit bigger so that noisy actions actually move the robot and check if it prevents std diverging, if it even diverges in the first place
- doesn't seem like it diverges and it seems like after a short while it stops taking big actions and actually learns stuff, so I think there is no need for the DeadZone wrapper. A deadzone of 0.1 with max act at 0.9 worked, need to try a deadzone of 0.0 and max act 0.9

Armandpl commented 4 months ago

Ok so even before gsde, it seems I can train on pendulum-v1 and get it to converge with continuous actoins but it doesn't work on FurutaSim-v0. Discretizing the action worked so I'm thinking there is an issue with the gradient computation for the continuous action. Maybe it is an issue with the way we compute and sum the entropy?

Looking at the distribution of the actions, its value is between -2 and 3 which is too much.

Should I save the clipped action in the replay buffer?
Should i make sure the action is between -1 and 1 w/ tanh?

Armandpl commented 3 months ago

Ok so I'm trying to get continuous actions to work on the FurutaSim-v0 env. It works perfectly if we discretize the action space wandb run Continuous actions also work on the Pendulum-v1 env with or without gsde. With gSDE it looks like the action is either ~1.0 or ~1.0 while the distribution is smoother without. With gSDE I haven't logged and can't remember if I squashed the actions with tanh and estimated the entropy with -log_prob.mean() or if I clipped the action and used Normal(mean, std).entropy()

But I'm thinking it is an entropy regularization issue because the distrib of actions looks pretty much binary which means the agent always chooses actions that are too big and in the case of the pendulum this violates speeds limits pretty quickly. In the case of discrete actions when we regularize entropy the agent takes actions that are way different. In the case of a normal distrib the entropy depends only on the std? so it makes sure there is some exploration but if the agent always chooses the mean to be -1 or 1 it doesn't solve the problem??

I tried using gSDE without the entropy regularizer but ended up having the same issue of actions being 'too big'. I stopped terminating the episode when the speed was too high and the episodes started getting longer but it still didn't converge to a working policy and again the actions were too big. wandb run. I also tried multiplying the action by 0.55 bc I thought maybe there isn't enough friction in the sim or something but that's dumb bc it work with discrete actions.

Can it be another issue? Maybe the way we pass gradients? Should the std for the gsde noise be bigger to explore more? Should we clip the mean of the policy lower? currently we clip it at 2.0 for numerical stability but tanh 2.0 is 1.0

pendulum wandb run entropy estimated with log prob mean bc we use tanh: https://wandb.ai/armandpl/minidream_dev/runs/f6cdz43u?nw=nwuserarmandpl pendulum wandb run entropy estimated w/ Normal dist, clipping the action, not using tanh: pendulum w/ higher init std pendulum lower clip_mean

maybe try another continuous env e.g BipedalWalker maybe it'll bug in a different way that's going to be informative

Armandpl / dreamerv3

prepare for training on real robot #11