[Question] Does the action clipping method affect model's performance？

DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

https://stable-baselines3.readthedocs.io

MIT License

8.69k stars 1.65k forks source link

[Question] Does the action clipping method affect model's performance？ #859

Closed BlueBlueGrey closed 2 years ago

BlueBlueGrey commented 2 years ago

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

Question

I find on_policy_algorithm use np.clip to avoid out of bound error, however, off_policy_algorithm use Tanh activation function make actions zoom to (- 1,1). Does the action clipping method affect model's performance？

Additional context

Add any other context about the question here.

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)

Miffyli commented 2 years ago

@araffin can give more detailed response (this is his area of expertise), but as far as I know, these are the default approaches taken by other researchers in continuous action spaces, so I feel they are a pretty robust choice.

araffin commented 2 years ago

Does the action clipping method affect the effect of the model？

what do you mean exactly? does it affect its performance? As long as your action space is normalized, I couldn't find evidence it had an impact yet (you can actually also activate proper bound handling with on policy algorithms when using gSDE by passing squash_output, see https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/policies.py#L400)

ghost commented 2 years ago

@BlueBlueGrey The answer is no in my experience. The more elegant tanh (squash_output=True) might sometimes perform empirically worse than clipping due to its difficulty to approach the boundaries or due to its sensitivity around zero. I have not found situations where one works and the other doesn't yet, so I would personally treat the problem as an implementation detail.