DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.69k stars 1.65k forks source link

[Question] How easily can RL networks learn a constant or a 0-output? #909

Closed user-1701 closed 2 years ago

user-1701 commented 2 years ago

Question

I am experimenting for quite some time for making a torque controlled Robot which only stands and learns a 0 output, but I am not sure if this is even possible.

Recently I am trying gSDE with SAC and PPO to learn walking cycles and standing, while explicitly punishing the actions and delta actions. (Both behave very differently, after 1 mil. steps SAC is still jagging, independent of sde_sample_freq; PPO which is said to be more sensitive to the sde_sample_freq after the original paper, is sometimes jagging wildly and followed by extreme actions at the limits.) I tried different hyperparameters, activation functions (Tanh,Relu,Selu).

I tried hardcoding locomotion and letting it be controlled by the NN, but also in this case the NNs have to best learn a constant for controlling a movement cycle, instead of counteracting a waveform continuously and noisy.

Maybe someone has some thoughts of this?! The alternative solutions also don't feel very robust: Introducing a min-action threshold, averaging the outputs to make them smoother, do more learning and experimentation, searching for other algorithms on the contrib page or in the web that f.i. allow more gSDE refinement.

user-1701 commented 2 years ago

Maybe the warmup time was too large to let SDE start, so I will try this now.

Also I understand that noise is necessary for exploration and the determinism=True tag should in principle skip exploration noise when evaluating