DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.85k stars 1.68k forks source link

[Feature Request] API for using custom distributions with existing policies #703

Open kachayev opened 2 years ago

kachayev commented 2 years ago

🚀 Feature

The Distribution abstraction is defined here. But the way to introduce a new distribution to be used in the context of existing policy implementation is not obvious. In a few places, SB3 processes distributions in closed-formed manner with respect to enumerated implementations, e.g. here.

Would be nice to have a hook for injecting new distribution (or providing different implementation for existing one).

Motivation

Here's my example. I have PPO with spaces.MultiDiscrete action space but I want (need, probably) to have implementation of MultiCategoricalDistribution tailored to a very specific use case. Ideally, having as little changes to PPO and/or existing ActorCriticPolicy implementation as possible. The code how I did it:

Note, that I also got lucky as the action space definition is known to make_proba_distribution helper. Otherwise replacing self.action_dist afterwords wouldn't work.

Pitch

Honestly, I'm not sure what would be the best solution. It seems like with the current implementation of ActorCriticPolicy the entire notion of the "distribution of actions" is split between the policy itself and mlp_extractor: the policy decided on which distribution to use, mlp_extractor returns latent values that are further used by the policy to parametrized previously instantiated distribution. Which makes all of them intertwined in not very obvious way. It seems like we can simplify the flow by allowing mlp_extractor to return already parametrized distribution, so AC only needs to use already defined API of the distribution object (sample, log_prob, etc). In this case, mlp_extractor carries both sides coin: distribution + parametrization of the distribution.

Would love to hear thoughts and feedback on this. Also, would be happy to work on the PR when/if the direction is clear.

Alternatives

TODO: fill this part in as the discussion goes.

Checklist

Miffyli commented 2 years ago

I agree this should be improved as testing any other distributions other than vanilla ones is now rather tricky. However I do not feel the proposal of combining distribution stuff into mlp_extractor would be good. The idea is that mlp_extractor transforms the state information into some useful latent and then the distribution object transforms that latent into values the distribution needs to create a probability distribution over actions. mlp_extractor latent will also be used in value prediction.

How about instead you could override the distribution by providing it as part of the agent creation, which is then carried over to make_proba_dist function?

Something like following (note: rushed pre-vacation example, not very well hashed out. I hope I did not miss parts where this class could be needed ^^)

class CustomCategoricalDistribution(Distribution):
    def __init__(self):
        super(CustomCategoricalDistribution, self).__init__()
        # All initialization here should be done appropriately for the action space
        # (or there could be some parameters here)

    def proba_distribution_net(self, latent_dim: int) -> nn.Module:
        # Return a layer (or layers) that transform a vector of `latent_dim` to
        # whatever this distribution needs to parametrize the probability distribution
        action_logits = nn.Linear(latent_dim, self.action_dim)
        return action_logits

    def proba_distribution(self, action_logits: th.Tensor) -> "CustomCategoricalDistribution":
        self.distribution = Categorical(logits=action_logits)
        return self

    def log_prob(self, actions: th.Tensor) -> th.Tensor:
        return self.distribution.log_prob(actions)

    def entropy(self) -> th.Tensor:
        return self.distribution.entropy()

    def sample(self) -> th.Tensor:
        return self.distribution.sample()

    def mode(self) -> th.Tensor:
        return th.argmax(self.distribution.probs, dim=1)

    def actions_from_params(self, action_logits: th.Tensor, deterministic: bool = False) -> th.Tensor:
        # Update the proba distribution
        self.proba_distribution(action_logits)
        return self.get_actions(deterministic=deterministic)

    def log_prob_from_params(self, action_logits: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
        actions = self.actions_from_params(action_logits)
        log_prob = self.log_prob(actions)
        return actions, log_prob

# Param could be under `policy_kwargs`, too.
agent = PPO("MlpPolicy", env, custom_action_distribution_cls=CustomCategoricalDistribution)
kachayev commented 2 years ago

The idea is that mlp_extractor transforms the state information into some useful latent and then the distribution object transforms that latent into values the distribution needs to create a probability distribution over actions.

Oh, okay. Now it makes more sense. I was under the impression that Distribution merely conveys how distribution itself works (sampling, log_prob, entropy). But it is also responsible for converting latent into parameters, seems like having flexibility is even more crucial.

I like the idea with new parameters action_distribution_cls (and maybe action_distribution_kwargs). It's consistent with how configuration for other parts of the flow works now. And seems to be flexible enough. Let draft some code out in my repo to see it in action.

kachayev commented 2 years ago

So I drafted some changes that are necessary, using ActorCriticPolicy as an example (didn't look into other policies yet). This commit. The code definitely looks much easier from user's perspective. But it took more changes that I expected:

I left quite a few xxx commits describing questions regarding more generic cases that I'm working right now as well as some thoughts around the flow. I definitely need more time to play around with it (mostly likely after holidays).

On a tangential note, to redefine MlpExtractor with own implementation I need to reload "internal" _build_mlp_extractor function and to assign class member variable in it. Not sure if I'm supposed to do so, doesn't seem like intended way of changing the behavior of the model. I guess this question deserves own issue but I will do this one thing at a time 😀

araffin commented 2 years ago

hello, sorry for the late reply (I'm on holidays, i will try to write a longer answer next week), i know that our current architecture is not really flexible (that allows notably to ensure everything runs as expected) but I'm not sure we need to introduce a mechanism for new distributions as it seems to be a very specific use case (i would encourage forking SB3 in that case). I'm open for discussions though and a draft PR would definitely help see how much complexity has to be added ;)

kachayev commented 2 years ago

Hey @araffin,

I see, and I absolutely agree that additional flexibility might not outweigh the complexity. The sketched implementation that I've linked earlier is a good showcase: it doesn't carry much of complexity but it does raise quite a few design questions/considerations that are not trivial to address targeting both consistency and backward compatibility.

I'm glad to stash this conversation (and work done), it still might be valuable for future design conversations.

araffin commented 2 years ago

I'm glad to stash this conversation (and work done), it still might be valuable for future design conversations.

I think I was not so clear, please open a draft PR, I will take a closer look when I have the time ;)

(besides that, I'm also afraid we would need to allow users to pass custom pre-processing)

Yangxiaojun1230 commented 2 years ago

I'm glad to stash this conversation (and work done), it still might be valuable for future design conversations.

I think I was not so clear, please open a draft PR, I will take a closer look when I have the time ;)

(besides that, I'm also afraid we would need to allow users to pass custom pre-processing)

Hi araffin, Is there any update ? From my side, I want to change the logits when call the distribution. Could you give me some advice? In a CustomCategoricalDistribution, only do a little changes like follow. def proba_distribution_net(self, latent_dim: int) -> nn.Module:

Return a layer (or layers) that transform a vector of latent_dim to

    # whatever this distribution needs to parametrize the probability distribution
    action_logits = nn.Linear(latent_dim, self.action_dim)
    action_logits = th.mul(action_logits ,self,mask_matrix)
    return action_logits
araffin commented 2 years ago

I think @kachayev did not have time to do the PR.

action_logits = th.mul(action_logits ,self,mask_matrix)

you should probably take a look at MaskablePPO in our contrib repo: https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html

Yangxiaojun1230 commented 2 years ago

@araffin Thanks for your reply, that's what I am looking for.

araffin commented 1 year ago

Alternative solution from https://github.com/martius-lab/pink-noise-rl

# Initialize agent
model = SAC("MlpPolicy", env)

# Set action noise
model.actor.action_dist = PinkNoiseDist(action_dim, seq_len)

# Train agent
model.learn(total_timesteps=10_000)