Closed johannespitz closed 12 months ago
Hi @johannespitz There are 2 general ideas here: 1) negative sum of squared actions is often added as a part of reward to make robot do more smoothed movements and reduce energy consumption. In a lot of cases something differentiable is better than non differentiable. 2) Depending on the env/robot some actions are moving towards -1 or 1 only. And if I dont do squashing actual values returned as mu can diverge a little bit which is bad in general. YOu can see this behavior in my IsaacGym fork, where I report action distribution to the tensorboard.
I think I tested all possible distributions for the IG and this one (return mu as is and logstd as independent from obs vector) was the best. I tested squashed normal, truncated normal, beta (it was unstable in a long run whatever I tried). But if you want to try something new it should be relatively easy: you can create your own ModelA2CContinuous class and test your ideas.
Thank you for the detailed answer! If you ever publish something or come across other work that studies the action clipping in detail I'd be very interested. But for now I guess I'll close the issue. Thanks again.
Hi, In the continuous PPO implementation you have two types of regularization, that as far as I understand prevent weird effects due to the action clipping that is often necessary after the standard normal distribution: https://github.com/Denys88/rl_games/blob/fe95913f5b42dc39869da1924188c7601d3cf133/rl_games/algos_torch/a2c_continuous.py#L164
Are there any publications that introduce them, or do you have empirical data, or at least an intuition when which type of regularization works best?
I see (#153, #89) that you are not a fan of tanh squashing. Have you experimented with truncated normal distributions (https://en.wikipedia.org/wiki/Truncated_normal_distribution)?