DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.69k stars 1.65k forks source link

[Question] PPO Action Scaling - custom Env #1293

Closed danielstankw closed 1 year ago

danielstankw commented 1 year ago

❓ Question

Hi, There are a few issues related to my problem but none of them answers my question so I am opening a new issue.

I am using PPO to learn parameters for my controller. Therefore action = parameters of controller. I saw that it is recommended to re-scale action space between [-1,1] and that later on I can change the scale in my environment back to the desired range. Questions

  1. If I wish for my parameters/ action space to only have positive values its still recommended to scale the values in the [-1,1] range and then unscale them in my env?
  2. How is it recommended to scale action space if I try to learn parameters and I do not want to limit the algorithm from learning and exploring? - my intuition tells me to define the upper and lower bound to some values, but on the other hand if my intuition is wrong this can prevent finding optimal parameters.
  3. In thepolicies.py it seems that when i use self.action_space = spaces.Box(low=low, high=high) the action space isn't re-scaled between [-1, 1] but clipped. How can I utilize the self.squash_output when using PPO, I saw that it is one of the parameters of the policies.ActorCriticPolicy but how can I modify it from the PPO class?
            if self.squash_output:
                # Rescale to proper domain when using squashing
                actions = self.unscale_action(actions)
            else:
                # Actions could be on arbitrary scale, so clip the actions to avoid
                # out of bound error (e.g. if sampling from a Gaussian distribution)
                actions = np.clip(actions, self.action_space.low, self.action_space.high)

Thank you for your help :)

Checklist

araffin commented 1 year ago

If I wish for my parameters/ action space to only have positive values its still recommended to scale the values in the [-1,1] range and then unscale them in my env?

yes, see https://youtu.be/Ikngt0_DXJg?t=738 for a longer explanation

How is it recommended to scale action space if I try to learn parameters and I do not want to limit the algorithm from learning and exploring?

I actually answer those questions in https://youtu.be/Ikngt0_DXJg (how to define a custom env, how to choose the action space)

How can I utilize the self.squash_output when using PPO,

you need to use gSDE (use_sde=True, you can find some examples in the RL Zoo: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ppo.yml#L28) and then pass squash_output=True via the net_kwargs. But by experience, it doesn't change much compared to clipping (default for PPO).

Anyway, for continuous actions, I would recommend you to use SAC/TD3/TQC instead of PPO (I also talk about that in the video), they all handled limits properly (with squashing).

danielstankw commented 1 year ago

Thank you very much for the explanation !