[Question] Why does unscaling action behaves differently in training and eval

akane0314 commented 1 year ago

❓ Question

I'm using PPO with squash_output=True option. It looks statistics (e.g., average reward, episode length) differs during collecting rollout and evaluating policy. After digging down, I found this is caused by the different behavior of unscaling action in collect_rollouts and predict function used while train and eval, respectively. Why is unscale_action not applied in collect_rollouts()?

https://github.com/DLR-RM/stable-baselines3/blob/d68ff2e17f2f823e6f48d9eb9cee28ca563a2554/stable_baselines3/common/policies.py#L352-L361

https://github.com/DLR-RM/stable-baselines3/blob/d68ff2e17f2f823e6f48d9eb9cee28ca563a2554/stable_baselines3/common/on_policy_algorithm.py#L174-L177

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

araffin commented 1 year ago

Hello, I assume you are using a custom environment and the boundaries are not [-1, 1] ? (the env checker should have warned you). But yes, it looks like a bug.

akane0314 commented 1 year ago

Yes, I'm using custom env and boundaries are not in [-1, 1].

MikhailGerasimov commented 1 year ago

There is a related bug (or possibly a feature?) in SB3. It attempts to unsquash actions even when I use a distribution that does not support squashing. For example: model = PPO("MlpPolicy", env, policy_kwargs=dict(squash_output=True)) Although my action space is bounded between -1 and 1, due to the unsquashing process, I receive values in the env.step() function that go beyond the boundaries of [-1, 1].

araffin commented 1 year ago

Hello, I would welcome PR that solves both issues ;) (for the second one, we should not allow squash_output=True when not using gSDE).

ReHoss commented 1 year ago

Hi,

Correct me if I am wrong but the "squash_output: Whether to squash the output using a tanh function" in the documentation is misleading as the $\tanh$ non-linearity when squash_output=True is only applied through create_mlp which is only called by ContinuousCritic and Actor objects in the source code, excluding for instance PPO. I rather observe a simpler scaling from the self.unscale_actions function in predict when using squash_output=True.

If I understand correctly the first issue is the following:

The forward method of the nn.Module does not unscale actions using action space bounds.
The predict method of the BasePolicy does unscale actions using action space bounds.
The rollout buffer contains actions that are not unscaled.
The algorithm learns with not unscaled actions from the rollout buffer to interact with the environment.
The predict method during evaluation implies interaction with the environments using unscaled actions.

It looks like the squash_output argument is almost useless in the case of on-policy algorithms in the code and was developed for Experience Replay methods see create_mlp in Off-Policy methods?

Would replacing predict by the implicit low-level forward call in evaluate be a solution ? I am not sure since Off-policy algorithms are truly squashed.

I don't think modifying collect_rollouts with unscale_action being relevant as in the On-policy case there is nothing to truly unsquash (since no squashing non-linearity is applied to the policy)? It would make more sense I think in the Squashed Gaussian case as it is used for SAC. Otherwise, the risk would be to unscale a distribution with unbounded support... A solution would be to unscale after the actions clipping in the collect_rollouts method but it is not very elegant and inconsistant with Off-policy methods implementation as we would use the clipping present in collect_rollouts to simulate a distribution with bounded support, while activation functions are used in all the other cases...

After reasoning, it seems the true bug being On-policy methods not really squashing the outputs ? Another solution would be to remove the squash_output in the On-Policy case, somehow simplifying the code.

Best,

araffin commented 1 year ago

It looks like the squash_output argument is almost useless in the case of on-policy algorithms in the code and was developed for Experience Replay methods see create_mlp in Off-Policy methods?

Actually, the tanh comes from the distributions (squashed Gaussian and gSDE): https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/common/distributions.py#L250 and https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/common/distributions.py#L465-L468

the tanh in the create_mlp is for algorithms that don't rely on distributions (DDPG, TD3).

To sum up, everything which is internal uses scaled action, only the interaction with the env, so the predict() method unscales the action to the correct bounds when needed.

DLR-RM / stable-baselines3

[Question] Why does unscaling action behaves differently in training and eval #1592

❓ Question

Checklist