[Bug]: Action Space 'clipped' at 1 in basic cases clips many actions at the beginning of training.

🐛 Bug

So I'm writing this as a bug , though it seems to be something partly intentional. That said I think it may have quite significant (dire) impact on some trainings.

When a basic env is created with a action space such as :

class MyGYM(gym.Env):
        self.action_space = gym.spaces.box.Box(low=-np.ones(8) * 1,
                                                                          high=np.ones(8) * 1)

and a basic PPO learning algo such as :+1:

        envs = make_vec_env(MyGYM, n_envs=int(args.nb_env))
        model = PPO("MlpPolicy",
                    envs,
                    batch_size=128,
                    n_steps=1024,
                    )

Then many (25/30% roughly) of the initial values of the actions are at the bounds, in this example -1.0 or 1.0. This is because many of the MLP network policies (who's outputs are roughly gaussian with std 1) are over the limit and stablebaselines seem to clip them.

This seem super bad to me since all these clipped values, possibly as high as 30%, will result in the corresponding inputs basically not have a gradient and it's a big loss of information at the beginning of the training.

It is also not possible to manually rescale before the clipping since it seem to be done directly, before getting in the step function for example.

Also I noted that it was not the case in stablebaselines == 1.5.0. In particular it was possible to use an infinite action space such as

class MyGYM(gym.Env):
        self.action_space = gym.spaces.box.Box(low=-np.ones(8) * np.inf,
                                                                          high=np.ones(8) * np.inf)

which is not possible anymore, and I can't really find a valid reason why that should be so, since I've worked with infinite action spaces in many sucessful RL settings.

What the reasoning behind ending up with up to 25% of initial basic networks values clipped ? Is this a bug ? If we want to bound them it would be WAY better to use functions such as atan or tanh which actually bounds the values but keep at least a minor difference between actions which depends on the inputs of the underlying network.

Regards

To Reproduce

Relevant log output / Error message

No response

System Info

No response

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] I have provided a minimal working example to reproduce the bug
[X] I've used the markdown code blocks for both code and stack traces.

Then many (25/30% roughly) of the initial values of the actions are at the bounds

what do you call initial values? the values before the first update? (here 1024 steps)

I have read the documentation

Have you read the "Why should I normalize the action space?" part? https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html#tips-and-tricks-when-creating-a-custom-environment

If we want to bound them it would be WAY better to use functions such as atan or tanh which actually bounds the values but keep

you can do that when using gSDE (use_sde=True and squash_output=True in the policy keyword arguments)

You can also change the initial standard deviation by specifying log_std_init=-2 for instance to the policy kwargs.

In particular it was possible to use an infinite action space

this was a bug, see https://github.com/DLR-RM/stable-baselines3/issues/897

and it's a big loss of information at the beginning of the training.

Did you check how much it is actually impacting performance? (my guess would be not much as the initial policy is random)

what do you call initial values? the values before the first update? (here 1024 steps)

Yes.

Have you read the "Why should I normalize the action space?" part?

Yes, and i'm not that clueless in ML. My point here would actually fall under these more advanced questions:

"Why should take care that all my inputs have at least a small impact on the loss / reward ?". The aswer would be "because it enables a gradient to be properly defined for all input and help the optim" or even
"why should my episodes statistics should be within healthy values", once again, because it helps the optim by having at least as many good examples as bad and properly update weights. In particular in my case where my env just draw 2-points lines on an image, most of my points are at the bounds of the image ! This makes the job very hard for gradient descent and the data is dominated by actions / states that I would not want to see and instead penalize. But i can't even penalize them since It's 30% of the data and is the result of many type of different inputs.

this was a bug, see https://github.com/DLR-RM/stable-baselines3/issues/897

Reading the discussion i'm not convinced at all this was a "bug" at all as he says. His tensors just goes to Nan because his networks don't have regularisation. It's the case in many ML application and regularising is literally one thing you can't really do without on most tasks/dataset. In this case it's also a lot cleaner to do this than clipping the action space.

Did you check how much it is actually impacting performance? (my guess would be not much as the initial policy is random)

I really dont have too much time to do the bench but my guess is that ... it's important. The policy is random but if up to 30% of the initial actions are out of bounds you are missing so many gradients, and the initial action space is just .. badly distributed. You have many actions out of bounds and even if you put regularization on the actions the gradient descent won't even know which way to move the weights to get them IN THE BOUNDS.

I really dont have too much time to do the bench but my guess is that ... it's important

you can try quickly using the different solutions I pointed out.

You can also use SAC/TD3/TQC instead which are usually better suited for continuous control problems, are more sample efficient (because it seems to matter in your problem) and they handle action space boundaries.

won't even know which way to move the weights to get them IN THE BOUNDS.

you probably noticed but the gradient will make the standard deviation smaller (making the policy more deterministic) which in turn will make it more likely to be in the bounds.

Btw, the actions stored in the rollout buffer are not the clipped ones but the unbounded ones, so there should not be issue with the gradient.

what do you call initial values? the values before the first update? (here 1024 steps)

Yes.

Have you read the "Why should I normalize the action space?" part?

Yes, and i'm not that clueless in ML. My point here would actually fall under these more advanced questions:

"Why should take care that all my inputs have at least a small impact on the loss / reward ?". The aswer would be "because it enables a gradient to be properly defined for all input and help the optim" or even

"why should my episodes statistics should be within healthy values", once again, because it helps the optim by having at least as many good examples as bad and properly update weights. In particular in my case where my env just draw 2-points lines on an image, most of my points are at the bounds of the image ! This makes the job very hard for gradient descent and the data is dominated by actions / states that I would not want to see and instead penalize. But i can't even penalize them since It's 30% of the data and is the result of many type of different inputs.

this was a bug, see #897

Reading the discussion i'm not convinced at all this was a "bug" at all as he says. His tensors just goes to Nan because his networks don't have regularisation. It's the case in many ML application and regularising is literally one thing you can't really do without on most tasks/dataset. In this case it's also a lot cleaner to do this than clipping the action space.

Did you check how much it is actually impacting performance? (my guess would be not much as the initial policy is random)

I really dont have too much time to do the bench but my guess is that ... it's important. The policy is random but if up to 30% of the initial actions are out of bounds you are missing so many gradients, and the initial action space is just .. badly distributed. You have many actions out of bounds and even if you put regularization on the actions the gradient descent won't even know which way to move the weights to get them IN THE BOUNDS.

Hi, I identified this problem too. Even with activation functions such as sigmoid or tanh, this behaviour is likely to occur. I manage to have more stable trainings by normalising both the action space and the state space between [-1, 1] with some min-max normalisation but I agree with you on the fact that it is not that easy to handle. For instance, I must clip/bound the action space otherwise my environment (controlled Kuramoto-Sivashinsky PDE) would returns NaNs.

I agree on your two refined questions especially on the statistical distribution of the inputs/outputs. Even without any clipping, tanh/sigmoid and squashing activation functions would trigger vanishing gradient because at some point our precision is too low to discriminate sample (input data) after some treshold.

Finally, I think the way stable_baselines3 implements those algorithms is very reliable. However I recommend you to perform normalisation and analyse how it affects the distribution of the predictions just after the actor neural network have been initialised.

Reward shape could also affect learning.

DLR-RM / stable-baselines3