[Question] How easily can RL networks learn a constant or a 0-output?

Question

I am experimenting for quite some time for making a torque controlled Robot which only stands and learns a 0 output, but I am not sure if this is even possible.

Recently I am trying gSDE with SAC and PPO to learn walking cycles and standing, while explicitly punishing the actions and delta actions. (Both behave very differently, after 1 mil. steps SAC is still jagging, independent of sde_sample_freq; PPO which is said to be more sensitive to the sde_sample_freq after the original paper, is sometimes jagging wildly and followed by extreme actions at the limits.) I tried different hyperparameters, activation functions (Tanh,Relu,Selu).

I tried hardcoding locomotion and letting it be controlled by the NN, but also in this case the NNs have to best learn a constant for controlling a movement cycle, instead of counteracting a waveform continuously and noisy.

Maybe someone has some thoughts of this?! The alternative solutions also don't feel very robust: Introducing a min-action threshold, averaging the outputs to make them smoother, do more learning and experimentation, searching for other algorithms on the contrib page or in the web that f.i. allow more gSDE refinement.

DLR-RM / stable-baselines3

[Question] How easily can RL networks learn a constant or a 0-output? #909

Question