DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.34k stars 1.6k forks source link

[Feature Request] Modular Policies #544

Closed 09tangriro closed 2 years ago

09tangriro commented 2 years ago

🚀 Feature

Hi again, I think the policies can be overhauled somewhat to increase modularity and reduce repeated code. In particular, a separate base Actor and Critic method could be used, which could then be combined in unique ways to create the various ActorCritic policies.

Motivation

This would be really useful when it comes to fast prototyping of new algorithms since it would reduce time spent writing out the algorithm policy so in depth.

Pitch

Add a base Actor and Critic policy with derived classes (e.g. ContinuousCritic) which are flexible and modular to be easily combined in new algorithm policies.

Alternatives

N/A

Additional context

Add any other context or screenshots about the feature request here.

 Checklist

araffin commented 2 years ago

Hello, to understand a bit more the scope of the request: would that be for any algorithms or for off-policy/on-policy algorithms only?

09tangriro commented 2 years ago

In my head I was thinking about any algorithm that uses an actor or critic neural net or some combination of both. I'm not too aware of other deep rl algorithms that aren't off-policy or on-policy but if they also use the concept of an actor and/or a critic then I would argue these can also be encapsulated.

araffin commented 2 years ago

In my head I was thinking about any algorithm that uses an actor or critic neural net or some combination of both.

So for off-policy algorithms, I think it is already the case, you can replace the make_actor() and make_critic() to return any object that fits.

For on-policy algorithms, it is more tricky as we allow shared architecture to save computation. I recently did an example for an RL Course though. You can find it here: https://github.com/osigaud/stable-baselines3/blob/master/stable_baselines3/reinforce/policies.py

09tangriro commented 2 years ago

Maybe I've missed it but I can only see the make_actor() and make_critic() methods in the policy implementations in TD3/policies.py and SAC/policies.py, not in common/policies.py. I'd argue that for making new algorithms the tools should be in the common module rather than under the individual algorithm implementations.

Currently, in common/policies.py, there is a BaseModel, BasePolicy, ActorCriticPolicy and ContinuousCritic main classes. It would be useful if there were Actor and Critic base modules which could also be inherited in making new policies. For example, currently, SAC and TD3 both have and Actor class which is called by the main policy method, which is a repeated concept that will have to be coded over and over again in any new algorithm, or you have to search through all the individual algorithm implementations to see if you can inherit one of them.

In your example, I would argue it would be ideal to not need to code so specifically the Actor and Critic classes and instead just have the REINFORCEPolicy which calls a generalized Actor and Critic base class located under common/policies.py.

I think it should still be possible to have a shared architecture for the on-policy algorithms, since each policy is also split into modules (feature extractor, main network and heads), but also newer on-policy algorithms (e.g. PPG: https://arxiv.org/abs/2009.04416v1) don't have a shared architecture at all and in fact perform better. With the current implementation, it takes longer than it needs to to implement such algorithms because of the way the policies jump from BasePolicy to ActorCriticPolicy without the middle base Actor and Critic also being implemented.

Do have a look at the following paper: https://arxiv.org/abs/2011.07537 which introduces Tonic that aims to solve this problem. I would argue the design philosophy of this library is much more conducive (at least right now) to fast prototyping of new algorithms because of how well modularized it is. However, it is very new and I think it would take longer to add all the features of SB3 to Tonic than to take the modular policies of Tonic and add them to SB3.

Of course, the caveat to all this is I haven't actually implemented this myself yet, I just wanted to get some confirmation that if I can make it work that it would be a useful feature that would be added to the main library via a pr.

araffin commented 2 years ago

Maybe I've missed it but I can only see the make_actor() and make_critic() methods in the policy implementations in TD3/policies.py and SAC/policies.py, not in common/policies.py

yes, that's what I meant by "for off-poicy algorithms". The common/policies.py is mainly for on-policy algorithms.

For example, currently, SAC and TD3 both have and Actor class which is called by the main policy method, which is a repeated concept that will have to be coded over and over again in any new algorithm, or you have to search through all the individual algorithm implementations to see if you can inherit one of them.

That's normal. Each algorithm has its own way of implementing the Actor. For instance, although SAC and TD3 actors look alike, one is relying on a probability distribution where the other one is not, and this is a huge difference!

At that level of customization, a good knowledge of the used RL algorithm used is required.

Do have a look at the following paper: https://arxiv.org/abs/2011.07537 which introduces Tonic that aims to solve this problem

I do know Tonic ;) And in fact, one of our design choice is to not be too modular but still avoid some code duplication with object oriented programming.

but also newer on-policy algorithms (e.g. PPG: https://arxiv.org/abs/2009.04416v1) don't have a shared architecture at all and in fact perform better.

PPG is quite special and would in fact not fit any current definition of Actor as the actor has an additional output (the value).

With the current implementation, it takes longer than it needs to to implement such algorithms because of the way the policies jump from BasePolicy to

I do agree that the current implementation is not the cleanest for on-policy algorithms, but as we show in the documentation, you can create custom architecture for both with few edits only: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#advanced-example

09tangriro commented 2 years ago

Yes good points, there's always compromises that must be made. I'll close the issue :)