DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.68k stars 1.65k forks source link

[Feature Request] independently configurable learning rates for actor and critic #338

Open stheid opened 3 years ago

stheid commented 3 years ago

🚀 Feature

independently configurable learning rates for actor and critic in AC-style algorithms

Motivation

In literature the actor is often configured to learn slower, such that the critics responses are more reliable. At least it would be nice if i could allow my hyperparameter optimizer to decide which learning rates he wants to use for actor or critic.

Pitch

https://github.com/DLR-RM/stable-baselines3/blob/65100a4b040201035487363a396b84ea721eb027/stable_baselines3/ddpg/ddpg.py#L12-L26

Additional context

https://spinningup.openai.com/en/latest/algorithms/ddpg.html#documentation-pytorch-version

Webbah commented 3 years ago

Seems like in the original silver paper they use different learning rates in the application, too:

grafik

https://arxiv.org/abs/1509.02971 p.11

Miffyli commented 3 years ago

Since the original DDPG paper used this approach I think this should be included here as well down the line, however I am not sure about including it outside DDPG: TD3 did not use this as far I am aware (@araffin ?) and tbh I have never seen this type of trickery before (do share references if other algorithms use this).

Shared parameters also become a problem here and I am not sure what the behaviour would be like. Quickly reading through DDPG paper I did not see a note on if they shared parameters between actors and critics, so I assume they had separate networks.

araffin commented 3 years ago

Hello,

independently configurable learning rates for actor and critic in AC-style algorithms

so you are proposing that for DDPG, TD3 and SAC? (it does not apply to PPO/A2C as the value and policy share the same optimizer)

In literature the actor is often configured to learn slower, such that the critics responses are more reliable. i could allow my hyperparameter optimizer to decide which learning rates he wants to use for actor or critic.

I have mixed feelings about that. If your remark is only about DDPG, then I can only highly recommend you to use TD3 (DDPG + some tricks to stabilize it) which integrates "delayed" updates exactly for that purpose. You can also take a look at SAC or even better, TQC, which is the current state of the art algorithm for continuous control and implemented in our contrib repo: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

On one hand, the request seems reasonable. On the other hand, this would add complexity (we would need to support schedule for each of them) and won't necessary help in your hyperparameter optimization as it would make the search space wider.

stheid commented 3 years ago

the search space does not get that much more complex. I would have a "base learningrate" and a "actor learningrate scale" to change the base learningrate for one of the networks by a factor.

Modern blackbox optimizers like optuna are quite efficient in learning that.

For research it is a bit of a problem to use stable-baselines as i cannot reproduce the work of other publications as most of the literature does exactly that. TD3 is nice, but using an different algorithm is less comparable. To be totally honest i have not looked that much into TD3 though.

The pointers to the contrib part ist definetly interesting and i totally understand when you dont deem it usefull for your purpose of this library. However, i might need to switch to acme than....

Thanks for your fast replay in any case.

araffin commented 3 years ago

i cannot reproduce the work of other publications as most of the literature does exactly that. TD3 is nice, but using an different algorithm is less comparable.

TD3 is literally DDPG + tricks, so it will give you at least as good results as DDPG and usually better (or much better). So if you use TD3 for your baselines (when reproducing previous work), I would say it is still a fair comparison.

The pointers to the contrib part ist definetly interesting and i totally understand when you dont deem it usefull for your purpose of this library. However, i might need to switch to acme than....

To sum the discussion: you would like to have two different lr because of performance + comparison reasons. My main concern is that such decoupling of learning rates is usually not needed, especially with the most recent algorithms (DDPG was published in 2015) and therefore may add unnecessary complexity in our codebase.

But if you really need two different learning rates, you can simply define a custom DDPG class that overrides that method:

https://github.com/DLR-RM/stable-baselines3/blob/65100a4b040201035487363a396b84ea721eb027/stable_baselines3/common/base_class.py#L245-L259

called here: https://github.com/DLR-RM/stable-baselines3/blob/65100a4b040201035487363a396b84ea721eb027/stable_baselines3/td3/td3.py#L129-L130

that would look like:

class CustomDDPG(DDPG):
    def __init__(policy, env, *args, actor_lr=1e-5, critic_lr=1e-4, **kwargs):
        super().__init__(policy, env, *args, **kwargs)
        self.actor_lr = actor_lr
        self.critic_lr = critic_lr

    def _update_learning_rate(self, optimizers):
      actor_optimizer, critic_optimizer = optimizers
      ... # update the learning rate using different values if needed
stheid commented 3 years ago

Thanks for the help.

Just one more general note: You call it DDPG but its actually a variant of the original implementation. This is a little misleading in my opinion. Of course its perfectly valid to do so, but maybe it would be generally useful if the documentation would clarify which exact design changes are done with regard to the source publication.

I am still not convinced, but i no one prevents me from forking, hence i dont think further discussion would be a valuable use of time as if fully understand your standpoint :)

araffin commented 1 year ago

Re-opening as it might be helpful to be able to tune qf_learning_rate as done in https://github.com/araffin/sbx. We need to see how much complexity it adds though.

jake321southall commented 9 months ago

Hello,

independently configurable learning rates for actor and critic in AC-style algorithms

so you are proposing that for DDPG, TD3 and SAC? (it does not apply to PPO/A2C as the value and policy share the same optimizer)

Hi, I'm wanting to use different learning rates for the policy and value networks in PPO, can I not do this because they share the optimizer? If so, can I change the optimizer so they can or would you not recommend this? I'm using SB3 1.6.2. fyi

araffin commented 9 months ago

can I not do this because they share the optimizer?

yes

If so, can I change the optimizer so they can or would you not recommend this?

you can give it a try (you will need to fork SB3 for that), please report any result here ;)

EDIT: it will probably not work if the actor and critic share weights

cgr71ii commented 7 months ago

Seems like in the original silver paper they use different learning rates in the application, too:

Just one more general note: You call it DDPG but its actually a variant of the original implementation. This is a little misleading in my opinion. Of course its perfectly valid to do so, but maybe it would be generally useful if the documentation would clarify which exact design changes are done with regard to the source publication.

Hello! I came to this issue exactly for this, and I totally agree that a note in the docs about this change with original DDPG would prevent people from thinking that the same paper hyperparameters are being used (as indicated in https://github.com/DLR-RM/stable-baselines3/issues/1562#issuecomment-1596809692). Is this the only difference from the paper hyperparameters?

Also, I'd like to point out why it would be useful in my case to include this change in DDPG. I need to use DDPG in a paper, and I'd like to use the original hyperparameters, but I need DDPG and not TD3 because I want to use the Wolpertinger architecture, which is integrated using DDPG in the original paper (https://arxiv.org/pdf/1512.07679.pdf).

Is this feature request still under consideration (even though some DDPG hyperparameters have recently been changed https://github.com/DLR-RM/stable-baselines3/pull/1785)?

araffin commented 7 months ago

Is this feature request still under consideration

yes, but help from contributors is needed. Otherwise, you can have a look at SBX: https://github.com/araffin/sbx

Is this the only difference from the paper hyperparameters?

I think so. The rest is defined in the RL Zoo (action noise, and obs normalization).