DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.69k stars 1.65k forks source link

[Question] squash_output option not works for PPO #1540

Closed DDDOH closed 11 months ago

DDDOH commented 1 year ago

❓ Question

from stable_baselines3.common.env_util import make_vec_env
n_env = 4
vec_env = make_vec_env('Ant-v4', n_envs=n_env)
model = PPO("MlpPolicy", vec_env, verbose=1, policy_kwargs={'squash_output':True})

When set squash_output as True and not using gSDE (generalized State-Dependent Exploration), the action distribution is still DiagGaussianDistribution but not SquashedDiagGaussianDistribution.

The document says

squash_output (bool) – Whether to squash the output using a tanh function, this allows to ensure boundaries when using gSDE.

And it is unclear that squash_output will NOT work when not using gSDE.

I suggest make the document clear, or change the code such that even without gSDE, squash_output still works. And I think the second option makes more sense, since squash_output is still useful without gSDE.

Checklist

araffin commented 1 year ago

I suggest make the document clear

I would be happy to receive a PR =) (for both cases, but at least for the doc)

since squash_output is still useful without gSDE.

There are some discussions in SB2 and SB3 repos about handling action space limits, but so far (and according to a survey paper), clipping doesn't impact much performance (although it is not pretty).

tobirohrer commented 1 year ago

I would like to work and this one and can prepare a PR in the upcoming days

tobirohrer commented 1 year ago

sorry, but I do not understand the problem and expected behavior completely.

When I look for usages of SquashedDiagGaussianDistribution, I can only find one use in sac/policies.py. Meaning no matter which parameters I pass to PPO, I would never get the SquashedDiagGaussianDistribution as action distribution.

araffin commented 1 year ago

eaning no matter which parameters I pass to PPO, I would never get the SquashedDiagGaussianDistribution as action distribution.

PPO also uses gSDE and it has an option for squashed Gaussian distribution.

araffin commented 11 months ago

Closed by https://github.com/DLR-RM/stable-baselines3/pull/1652

namheegordonkim commented 1 month ago

clipping doesn't impact much performance (although it is not pretty).

Are you sure about this? More complex unsupervised RL or imitation learning applications like GAIL seem to suffer a huge problem with distribution drift and undefined OOD predictions that saturate action means extremely quickly without squashing. In my experience, clipping has to be coupled with some kind of explicit action magnitude regularization defined as a reward term or as a penalty term within PPO loss.