DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.69k stars 1.65k forks source link

[Feature Request] give users a way to provide their own exploration noise function? #1102

Closed jerabaul29 closed 1 year ago

jerabaul29 commented 1 year ago

🚀 Feature

Let the users provide, if they want, a function that is used to sample the exploration noise. Possibly an API something like the following (it may be very naive though), so that it can collect data from the agent and its policies, or take any arguments the users want:

def user_noise(agent, *args, **kwargs)
    """this user_noise function will be called at each step to generate the exploration noise"""
    return exploration_noise

Motivation

The gSDE paper is extremely interesting, however, in some cases, using the before last policy layer may be a suboptimal choice; offering an API way to set other more specific exploration noise functions that depend on the agent internals exactly in the way the user wants would be great.

qgallouedec commented 1 year ago

I understand your need. Have you considered the following workaround:

from stable_baselines3 import DDPG
from stable_baselines3.common.noise import Noise

class MyNoise(Noise):
    def set_model(self, model):
        self.model = model

    ...

action_noise = MyNoise()
model = DDPG("MlpPolicy", "Pendulum-v1", action_noise=action_noise)
action_noise.set_model(model)
jerabaul29 commented 1 year ago

Aaah, this is nice, seems to be exactly what I was asking for, right? :) Or do you see any caveats / limitations?

I was looking initially at the PPO algorithms, and there I could not find a similar feature, which is why I asked. I am used to using PPO, and this is why I looked at it first, but I guess I could use SAC instead... Looks like A2C does not have the feature. I have not looked at the other agents.

I think the conclusion is that it looks like this is at the moment a tiny bit "inconsistent" between agents, some have this argument, and some not, according to the documentation at https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html and various agent kinds, is this correct? Could this indicate that some standardization (making this argument available consistently across agents) could be something useful? :)

jerabaul29 commented 1 year ago

@lguas may be relevant if some more interest to look into SDE :) .

qgallouedec commented 1 year ago

About noise for on-policy algorithms: https://github.com/DLR-RM/stable-baselines3/issues/368#issuecomment-808370595

qgallouedec commented 1 year ago

Could this indicate that some standardization

Off-policy: has action_noise parameter On-policy: hasn't

jerabaul29 commented 1 year ago

Thanks for the link. My 2 cents:

qgallouedec commented 1 year ago

The intuition behind not using action noise for on-policy agents is that an on-policy algorithm learns from the actions it has taken. However, if you add noise to the action, it is no longer "taken by the agent" but by the agent augmented with noise. So you introduce a bias.

qgallouedec commented 1 year ago

Or do you see any caveats / limitations?

It works, although it's not very satisfying because it seems to bite the tail. Let me know if you notice a caveat

jerabaul29 commented 1 year ago

The intuition behind not using action noise for on-policy agents is that an on-policy algorithm learns from the actions it has taken. However, if you add noise to the action, it is no longer "taken by the agent" but by the agent augmented with noise. So you introduce a bias.

Yes, but in the end, for example for PPO, we still end up adding exploration noise, and this works better than without, am I correct? :) So the theory that "action distribution is exploration" for on policy agents is not really how things do work in practice, and some on policy algorithms (like PPO) actually work well "slightly off policy", right? :)

Thanks for confirming that it should work :) . I will let you know.

araffin commented 1 year ago

The gSDE paper is extremely interesting, however, in some cases, using the before last policy layer may be a suboptimal choice t. What if I want to use some SVD, or auto encoder, to generate a compressed state and use that instead as the input to the gSDE noise generation? :)

I did some experiments (unpublished) where I replaced the last layers by other intermediate layers but as shown in the paper, what matters for both performance and smoothness is not the state dependent but rather the "noise repeat" interval (sde_sample_freq).

The current gSDE implementation can be seen as doing parameter space exploration but only for the last layer of the actor.

is there a compelling reason for having a different API for off policy vs on policy exploration noise (in the form of having or not having action_noise parameter)?

on-policy algorithms require the data to be collected with the current policy (hence the "on"), so you cannot add noise to the policy (the behavioral policy that collects data would then be different to the one you use to do policy gradient updates).

araffin commented 1 year ago

Yes, but in the end, for example for PPO, we still end up adding exploration noise, and this works better than without, am I correct?

PPO samples actions from its policy, it's not adding external exploration noise. But it is true that PPO update look like off-policy corrected update.

jerabaul29 commented 1 year ago

PPO samples actions from its policy, it's not adding external exploration noise. But it is true that PPO update look like off-policy corrected update.

Mmmh, but for PPO you do implement both SDE noise and Gaussian noise, right? These are well additional noise on top of the policy noise, or am I missing something? (or is it what you mean with off-policy corrected update? :) ).

jerabaul29 commented 1 year ago

The current gSDE implementation can be seen as doing parameter space exploration but only for the last layer of the actor.

Yes, I understand that :) . I think that, depending on the case looked at, this may be a bit restrictive maybe.

I did some experiments (unpublished) where I replaced the last layers by other intermediate layers but as shown in the paper, what matters for both performance and smoothness is not the state dependent but rather the "noise repeat" interval (sde_sample_freq).

Interesting, thank you for this information. Still, I think that in some cases, it could maybe make sense to use another input to the gSDE noise generation. For example, in fluid dynamics control where we have clear structures in boundary layers, we may want to relate the exploration to these structures, and they are known to be captured quite well by either some SVD, or autoencoders. I think that for our applications, it would make more sense than using a somewhat arbitrary "before last layer of the policy network". Of course that may be wrong, we would need to test, but getting a bit more flexibility there would be useful :) . If I understand well, in cases where action_noise is available, this does is doable easily already though :) .

jerabaul29 commented 1 year ago

PPO samples actions from its policy, it's not adding external exploration noise. But it is true that PPO update look like off-policy corrected update.

Mmmh, but for PPO you do implement both SDE noise and Gaussian noise, right? These are well >additional noise on top of the policy noise, or am I missing something?

A side note on this (sorry, a few ideas starting to interweave here); I come from Tensorforce, and there I am quite sure (though @AlexKuhnle may correct me) that the random gaussian exploration is added on top of the policy spread exploration; I assumed / interpreted it in the same way from reading the API of stable-baselines3, apologies if I got this wrong. I think (but again, @AlexKuhnle should correct me if I am wrong) that we had some discussions @AlexKuhnle and I where I was explained that these on vs off policy distinctions are maybe not as clear cut as some say - in a sense, if you use a replay buffer, and you do not flush it at each update, then you automatically use some off policy data at every update anyways, right? :)

araffin commented 1 year ago

Mmmh, but for PPO you do implement both SDE noise and Gaussian noise, right? These are well additional noise on top of the policy noise, or am I missing something? (or is it what you mean with off-policy corrected update? :) ).

PPO sample in the two cases from a Gaussian distribution (see SDE paper for which Gaussian, both are centered around the deterministic action).

jerabaul29 commented 1 year ago

Mmmh, but for PPO you do implement both SDE noise and Gaussian noise, right? These are well additional noise on top of the policy noise, or am I missing something? (or is it what you mean with off-policy corrected update? :) ).

PPO sample in the two cases from a Gaussian distribution (see SDE paper for which Gaussian, both are centered around the deterministic action).

I don't really think I see the point / conceptual difference. On vs off policy is about following the policy or not following it, right? (though I agree that this distinction may be a bit artificial, I suppose it is possible to "mostly follow" the policy). I wonder if I am confused, or if the terminology is used in a non fully consistent way.

If I understand well, the PPO policy outputs typically a multivariate gaussian or beta distribution, ie a set of (mu, sigma [gaussian]) or (alpha, beta [beta]). Then, during the training / exploration, these distributions are sampled randomly following their probability density function rather than using their mode, i.e. exploration intrinsic to the policy uncertainty. This is "on policy" exploration that comes from following "on policy" trajectories rather than "greedy" / "noisy" trajectories, if I understand well the terminology.

However, in addition to this on-policy exploration, a random exploration noise N(0, mu) can typically be added (at least, it is so in Tensorforce (TF); it is true though that I cannot find the corresponding parameter in SB3 API, maybe this is a curiosity of TF and not possible in SB3? ...). This is actually, I suppose, making the PPO with such random exploration effectively off-policy?

From reading https://arxiv.org/pdf/2005.05719v2.pdf and looking at eqn. (4) there, it looks like gSDE is "simply" a smart way to generate some random exploration noise that is "randomly dependent on the state" rather than "purely random", hence, having better time variation / autocorrelation properties than the random exploration noise that can be switched on in TF. Is that correct? I re-read the paper https://arxiv.org/pdf/2005.05719v2.pdf and double checked that this understanding seems to be correct, but maybe I am misunderstanding things?

If this is well correct, then this is maybe one more indication that this whole on-policy vs off-policy distinction is maybe not as clear cut / relevant as it seems, and that it can make sense to provide an action_noise hook on all agents?

araffin commented 1 year ago

I don't really think I see the point / conceptual difference. On vs off policy is about following the policy or not following it, right?

it's also about the theory behind it, the derivation of the update is different. It is true that there is a grey zone with PPO and its importance sampling but still the derivation is quite different compared to DQN/SAC for instance.

This is "on policy" exploration that comes from following "on policy" trajectories rather than "greedy" / "noisy" trajectories, if I understand well the terminology.

I would not call it "on-policy exploration" as SAC explores in the same way. However, if you take DDPG/TD3, there is no underlying distribution that you can sample from, so you need to add external noise.

there, it looks like gSDE is "simply" a smart way to generate some random exploration noise that is "randomly dependent on the state" rather than "purely random"

The main result of gSDE is Figure 2, gSDE allows to have same performance as unstructured noise while being "hardware friendly" (reducing wear and tear, shown as continuity cost here). gSDE is generating exploration noise that is smoothly changing (instead of white noise) while being compatible with policy gradient methods (because the distribution of the noise is known and log prob has an explicit form).

elevant as it seems, and that it can make sense to provide an action_noise hook on all agents?

what you could do is add a wrapper that will add noise to the action (that the agent is not aware of) making the stochasticity part of the environment (if you do so, you would need to feed the previous action as input to not break Markov assumption). Note that in that case the additional noise will not be store in the rollout buffer.

jerabaul29 commented 1 year ago

Thanks for the detailed answer. I would prefer not to add some noise that the agent is not aware of through a wrapper - I would believe that this would likely increase noise from the point of view of the agent / make the learning unnecessarily harder.