DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.13k stars 1.7k forks source link

[Question] Learning action space with stability constraints #617

Closed danielstankw closed 3 years ago

danielstankw commented 3 years ago

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

Question

I am using stable baselines to learn the parameters of my controller. My action space is an array of (1,36) which gets reshaped to a (6x6) matrix used for controlling the robot. I am updating the action space only once per episode due to the nature of problem I am trying to solve. As a stability constraint, the matrix has to be positive definite in order not to cause any instability issues in my custom environment. I am wondering if anyone has experience with this kind of problem and could suggest some possible approaches to tackling this problem?

I have tried/ thought about:

I haven't tried the last one but would love to hear some feedback.

P.S. Actually, I try to learn 108 parameters i.e. 3 matrices each 6x6 but the above simplification illustrates my issue.

Checklist

araffin commented 3 years ago

Hello,

I am updating the action space only once per episode due to the nature of problem I am trying to solve.

Your problem sounds more suited for black box optimization and evolution strategies (for instance CMAES).

As a stability constraint, the matrix has to be positive definite in order not to cause any instability issues in my custom environment. I am wondering if anyone has experience with this kind of problem and could suggest some possible approaches to tackling this problem?

Maybe you can search for a factorization that ensure such constrain?

You should probably take a look at those two threads:

It also seem that you could reduce your search space by imposing your matrix to be symmetric too (if it makes sense).

danielstankw commented 3 years ago

Thank you for your quick reply I am not familiar with the mentioned approaches so I will definitely take a closer look on your suggestions Your problem sounds more suited for black-box optimization and evolution strategies (for instance CMAES).

Thank you for the links. Constraining the search space by imposing a matrix to be symmetric will help with learning but will limit the parameters that could work so for now I will look for a more general approach. One of the links you sent talks about matrix decomposition that can be useful so I will try to go in this direction. Thanks a lot :)

danielstankw commented 3 years ago

@araffin With regard to my previous issue: Let's say I want the policy to output only positive actions.

araffin commented 3 years ago

Is there a way to constrain the policy's action space?

As the output of the policy will be in [-1, 1] (if you follow best practices, see doc), you can easily rescale to [0, max] afterward: https://github.com/DLR-RM/stable-baselines3/blob/201fbffa8c40a628ecb2b30fd0973f3b171e6c4c/stable_baselines3/common/policies.py#L366-L375

Where/ How can I define the limits of action space?

Best is to fix them to [-1, 1] (it is the low and high parameter of the action_space object), for the agent but then rescale inside your env.

If you are using SAC/TD3, actions will be squashed to fit the limits using tanh() transform.

danielstankw commented 3 years ago

@araffin Following this issue: As the output of my policy I get set of actions that are used as my controller parameters. Those action values are in a range of [-1,1] but then I clip them in range [0,inf] as my values need to be positive The issue I have is that the outputs of the policy are all the time small i.e. not bigger than 1. while from my experiments i can see that i need actions of magnitude 1000 or so.

How can I force the policy to output actions with bigger values? If I rescale actions than in case of PPO in the file on_policy_alogrithm.py I have this code responsible for obtaining observation from environment, sampling action and clipping it.

If I want to rescale the actions would I have to do it somewhere here? - before they get saved in rollout_buffer? - as the backpropagation should be done on the actual values used in the problem. Or maybe it doesn't matter and I can do action re scaling purely in my env? - but in that case the policy will never output the correct magnitude of actions instead it will output action values that need to be scalled to desired magnitude to "solve" the problem

https://github.com/DLR-RM/stable-baselines3/blob/201fbffa8c40a628ecb2b30fd0973f3b171e6c4c/stable_baselines3/common/on_policy_algorithm.py#L161-L178