Open kachayev opened 2 years ago
I agree this should be improved as testing any other distributions other than vanilla ones is now rather tricky. However I do not feel the proposal of combining distribution stuff into mlp_extractor
would be good. The idea is that mlp_extractor
transforms the state information into some useful latent and then the distribution
object transforms that latent into values the distribution
needs to create a probability distribution over actions. mlp_extractor
latent will also be used in value prediction.
How about instead you could override the distribution by providing it as part of the agent creation, which is then carried over to make_proba_dist
function?
Something like following (note: rushed pre-vacation example, not very well hashed out. I hope I did not miss parts where this class could be needed ^^)
class CustomCategoricalDistribution(Distribution):
def __init__(self):
super(CustomCategoricalDistribution, self).__init__()
# All initialization here should be done appropriately for the action space
# (or there could be some parameters here)
def proba_distribution_net(self, latent_dim: int) -> nn.Module:
# Return a layer (or layers) that transform a vector of `latent_dim` to
# whatever this distribution needs to parametrize the probability distribution
action_logits = nn.Linear(latent_dim, self.action_dim)
return action_logits
def proba_distribution(self, action_logits: th.Tensor) -> "CustomCategoricalDistribution":
self.distribution = Categorical(logits=action_logits)
return self
def log_prob(self, actions: th.Tensor) -> th.Tensor:
return self.distribution.log_prob(actions)
def entropy(self) -> th.Tensor:
return self.distribution.entropy()
def sample(self) -> th.Tensor:
return self.distribution.sample()
def mode(self) -> th.Tensor:
return th.argmax(self.distribution.probs, dim=1)
def actions_from_params(self, action_logits: th.Tensor, deterministic: bool = False) -> th.Tensor:
# Update the proba distribution
self.proba_distribution(action_logits)
return self.get_actions(deterministic=deterministic)
def log_prob_from_params(self, action_logits: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
actions = self.actions_from_params(action_logits)
log_prob = self.log_prob(actions)
return actions, log_prob
# Param could be under `policy_kwargs`, too.
agent = PPO("MlpPolicy", env, custom_action_distribution_cls=CustomCategoricalDistribution)
The idea is that mlp_extractor transforms the state information into some useful latent and then the distribution object transforms that latent into values the distribution needs to create a probability distribution over actions.
Oh, okay. Now it makes more sense. I was under the impression that Distribution
merely conveys how distribution itself works (sampling, log_prob, entropy). But it is also responsible for converting latent into parameters, seems like having flexibility is even more crucial.
I like the idea with new parameters action_distribution_cls
(and maybe action_distribution_kwargs
). It's consistent with how configuration for other parts of the flow works now. And seems to be flexible enough. Let draft some code out in my repo to see it in action.
So I drafted some changes that are necessary, using ActorCriticPolicy
as an example (didn't look into other policies yet). This commit. The code definitely looks much easier from user's perspective. But it took more changes that I expected:
self.action_dist
_build
to deal with if/else branching over known spaces when creatinng self.action_net
_get_action_dist_from_latent
has similar if/else branching to decide on the set of parameters passed to proba_distribution
reset_noise
should be extended as well, I assumeI left quite a few xxx
commits describing questions regarding more generic cases that I'm working right now as well as some thoughts around the flow. I definitely need more time to play around with it (mostly likely after holidays).
On a tangential note, to redefine MlpExtractor
with own implementation I need to reload "internal" _build_mlp_extractor
function and to assign class member variable in it. Not sure if I'm supposed to do so, doesn't seem like intended way of changing the behavior of the model. I guess this question deserves own issue but I will do this one thing at a time 😀
hello, sorry for the late reply (I'm on holidays, i will try to write a longer answer next week), i know that our current architecture is not really flexible (that allows notably to ensure everything runs as expected) but I'm not sure we need to introduce a mechanism for new distributions as it seems to be a very specific use case (i would encourage forking SB3 in that case). I'm open for discussions though and a draft PR would definitely help see how much complexity has to be added ;)
Hey @araffin,
I see, and I absolutely agree that additional flexibility might not outweigh the complexity. The sketched implementation that I've linked earlier is a good showcase: it doesn't carry much of complexity but it does raise quite a few design questions/considerations that are not trivial to address targeting both consistency and backward compatibility.
I'm glad to stash this conversation (and work done), it still might be valuable for future design conversations.
I'm glad to stash this conversation (and work done), it still might be valuable for future design conversations.
I think I was not so clear, please open a draft PR, I will take a closer look when I have the time ;)
(besides that, I'm also afraid we would need to allow users to pass custom pre-processing)
I'm glad to stash this conversation (and work done), it still might be valuable for future design conversations.
I think I was not so clear, please open a draft PR, I will take a closer look when I have the time ;)
(besides that, I'm also afraid we would need to allow users to pass custom pre-processing)
Hi araffin, Is there any update ? From my side, I want to change the logits when call the distribution. Could you give me some advice? In a CustomCategoricalDistribution, only do a little changes like follow. def proba_distribution_net(self, latent_dim: int) -> nn.Module:
latent_dim
to # whatever this distribution needs to parametrize the probability distribution
action_logits = nn.Linear(latent_dim, self.action_dim)
action_logits = th.mul(action_logits ,self,mask_matrix)
return action_logits
I think @kachayev did not have time to do the PR.
action_logits = th.mul(action_logits ,self,mask_matrix)
you should probably take a look at MaskablePPO
in our contrib repo: https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html
@araffin Thanks for your reply, that's what I am looking for.
Alternative solution from https://github.com/martius-lab/pink-noise-rl
# Initialize agent
model = SAC("MlpPolicy", env)
# Set action noise
model.actor.action_dist = PinkNoiseDist(action_dim, seq_len)
# Train agent
model.learn(total_timesteps=10_000)
🚀 Feature
The
Distribution
abstraction is defined here. But the way to introduce a new distribution to be used in the context of existing policy implementation is not obvious. In a few places, SB3 processes distributions in closed-formed manner with respect to enumerated implementations, e.g. here.Would be nice to have a hook for injecting new distribution (or providing different implementation for existing one).
Motivation
Here's my example. I have PPO with
spaces.MultiDiscrete
action space but I want (need, probably) to have implementation ofMultiCategoricalDistribution
tailored to a very specific use case. Ideally, having as little changes to PPO and/or existingActorCriticPolicy
implementation as possible. The code how I did it:Distribution
(this step is obvious)self.action_dist
after instantiation ofActorCriticPolicy
(subclass) -- this step feels ... weird, as__init__
already made an initialization and reimplementation of entire__init__
seems too intrusive from API perspective_get_action_dist_from_latent(Tensor) -> Distribution
(here) as the method itself raisesValueError
on any unknown distribution (here) -- this step feels weird for two reasons: 1) methods that starts from_
, ideally, should be treated as implementation details for the class not public API intended for introduction of changes in the behavior; 2) raisingValueError
seems unnecessary if probability distribution was already instantiated.Note, that I also got lucky as the action space definition is known to
make_proba_distribution
helper. Otherwise replacingself.action_dist
afterwords wouldn't work.Pitch
Honestly, I'm not sure what would be the best solution. It seems like with the current implementation of
ActorCriticPolicy
the entire notion of the "distribution of actions" is split between the policy itself andmlp_extractor
: the policy decided on which distribution to use,mlp_extractor
returns latent values that are further used by the policy to parametrized previously instantiated distribution. Which makes all of them intertwined in not very obvious way. It seems like we can simplify the flow by allowingmlp_extractor
to return already parametrized distribution, so AC only needs to use already defined API of the distribution object (sample, log_prob, etc). In this case,mlp_extractor
carries both sides coin: distribution + parametrization of the distribution.Would love to hear thoughts and feedback on this. Also, would be happy to work on the PR when/if the direction is clear.
Alternatives
TODO: fill this part in as the discussion goes.
Checklist