Closed AlexPasqua closed 2 years ago
I managed to make it run without errors! :tada:
But since I haven't found a guide/demo nor a similar issue here, I'll briefly explain how I did it:
ActorCriticPolicy
).forward
, extract_features
, evaluate_actions
and predict_values
).class MyFeaturesExtractor(BaseFeaturesExtractor):
...
class MyMLPExtractor(nn.Module):
# not mandatory, just in my case I had also a custom MLP extractor
# because I had to use 2 different activation functions for the actor and the critic
...
class MyPolicy(ActorCriticPolicy):
def __init__(
self,
observation_space: gym.spaces.Space,
action_space: gym.spaces.Space,
lr_schedule: Callable[[float], float],
net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
activation_fn: Type[nn.Module] = nn.Tanh,
*args,
**kwargs,
):
super(MyPolicy, self).__init__(
observation_space,
action_space,
lr_schedule,
net_arch,
activation_fn,
# Pass remaining arguments to base class
*args,
**kwargs,
)
# non-shared features extractors for the actor and the critic
self.policy_features_extractor = MyFeaturesExtractor(...)
self.value_features_extractor = MyFeaturesExtractor(...)
# if features_dim of both features extractor are the same:
self.features_dim = self.policy_features_extractor.features_dim
# otherwise:
self.features_dim = {'pi': self.policy_features_extractor.features_dim,
'vf': self.value_features_extractor.features_dim}
# NOTE: if the 2 features dims are different, your mlp_extractor must be able
# to acceppt such dict AND ALSO an int, because the mlp_extractor will be first
# created with wrong features_dim (coming from wrong, default, feratures extractor) which is an int.
# Furthermore, note that with 2 different features dims the mlp_extractor cannot have shared layers.
delattr(self, "features_extractor") # remove the shared features extractor
# Disable orthogonal initialization (if you want, otherwise comment it)
self.ortho_init = False
# The super-constructor calls a '_build' method that creates the network and the optimizer.
# The problem is that it does so using a default features extractor, and not the ones just created,
# therefore we need to re-create the mlp_extractor and the optimizer
# (that otherwise would have used obsolete features_dims and parameters).
self._rebuild(lr_schedule)
def _rebuild(self, lr_schedule: Schedule) -> None:
""" Re-creates the mlp_extractor and the optimizer for the model.
:param lr_schedule: Learning rate schedule
lr_schedule(1) is the initial learning rate
"""
self._build_mlp_extractor()
# action_net and value_net as created in the '_build' method are OK,
# no need to recreate them.
# Init weights: use orthogonal initialization
# with small initial weight for the output
if self.ortho_init:
# TODO: check for features_extractor
# Values from stable-baselines.
# features_extractor/mlp values are
# originally from openai/baselines (default gains/init_scales).
module_gains = {
self.policy_features_extractor: np.sqrt(2),
self.value_features_extractor: np.sqrt(2),
self.mlp_extractor: np.sqrt(2),
self.action_net: 0.01,
self.value_net: 1,
}
for module, gain in module_gains.items():
module.apply(partial(self.init_weights, gain=gain))
# Setup optimizer with initial learning rate
self.optimizer = self.optimizer_class(self.parameters(), lr=lr_schedule(1), **self.optimizer_kwargs)
def _build_mlp_extractor(self) -> None:
self.mlp_extractor = MyMLPExtractor(...)
def extract_features(self, obs: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
"""
Preprocess the observation if needed and extract features.
:param obs: Observation
:return: the output of the feature extractor(s)
"""
assert self.policy_features_extractor is not None and self.value_features_extractor is not None
preprocessed_obs = preprocess_obs(obs, self.observation_space, normalize_images=self.normalize_images)
policy_features = self.policy_features_extractor(preprocessed_obs)
value_features = self.value_features_extractor(preprocessed_obs)
return policy_features, value_features
def forward(self, obs: th.Tensor, deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
"""
Forward pass in all the networks (actor and critic)
:param obs: Observation
:param deterministic: Whether to sample or use deterministic actions
:return: action, value and log probability of the action
"""
# Preprocess the observation if needed
policy_features, value_features = self.extract_features(obs)
latent_pi = self.mlp_extractor.forward_actor(policy_features)
latent_vf = self.mlp_extractor.forward_critic(value_features)
# Evaluate the values for the given observations
values = self.value_net(latent_vf)
distribution = self._get_action_dist_from_latent(latent_pi)
actions = distribution.get_actions(deterministic=deterministic)
log_prob = distribution.log_prob(actions)
return actions, values, log_prob
def evaluate_actions(self, obs: th.Tensor, actions: th.Tensor) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
"""
Evaluate actions according to the current policy,
given the observations.
:param obs: Observation
:param actions: Actions
:return: estimated value, log likelihood of taking those actions
and entropy of the action distribution.
"""
# Preprocess the observation if needed
policy_features, value_features = self.extract_features(obs)
latent_pi = self.mlp_extractor.forward_actor(policy_features)
latent_vf = self.mlp_extractor.forward_critic(value_features)
distribution = self._get_action_dist_from_latent(latent_pi)
log_prob = distribution.log_prob(actions)
values = self.value_net(latent_vf)
return values, log_prob, distribution.entropy()
def get_distribution(self, obs: th.Tensor) -> Distribution:
"""
Get the current policy distribution given the observations.
:param obs: Observation
:return: the action distribution.
"""
policy_features, _ = self.extract_features(obs)
latent_pi = self.mlp_extractor.forward_actor(policy_features)
return self._get_action_dist_from_latent(latent_pi)
def predict_values(self, obs: th.Tensor) -> th.Tensor:
"""
Get the estimated values according to the current policy given the observations.
:param obs: Observation
:return: the estimated values.
"""
_, value_features = self.extract_features(obs)
latent_vf = self.mlp_extractor.forward_critic(value_features)
return self.value_net(latent_vf)
Hope it can help someone!
Hello, thanks for the contribution, I would be happy to receive a PR that add a link in the doc to this issue ;)
(and for the future, allowing separate feature extractor for on-policy algorithms is probably a reasonable feature)
I would be happy to receive a PR that add a link in the doc to this issue ;)
Done in PR #1067 💪
I would be happy to receive a PR that add a link in the doc to this issue ;)
Done in PR #1067 💪
Hello Recently, I also want to implement two independent feature extraction networks, and then connect different mlp extraction networks respectively. Your code helped me a lot!
But here I have a question, when the feature dimensions extracted by the two feature extraction networks are different, how can I elegantly place the two corresponding dimensions into _build_mlp_extractor() (to facilitate subsequent network construction) ?
At present, I have tried adding these two variables(e.g. "policy_feature_dim" and "value_feature_dim") directly in the initialization method of the MyPolicy class, and then I thought that these two variables can be easily called in "_build_mlp_extractor()" function: e.g. "self.mlp_extractor = MyMLPExtractor(self, self.policy_feature_dim, self.value_feature_dim)". However, since "_build_mlp_extractor()" will be called in advance in the "super(MyPolicy, self).init() "method, it is suggested that MyPolicy class does not have these two attributes(e.g. "policy_feature_dim" and "value_feature_dim") . One solution I can think of at the moment is that I manually add the corresponding feature dimension information to the "kwargs", just like Question#906. So is there a better way?
Looking forward to your reply, thank you very much!
@wlxer I think you could pass the dimensions as parameters to your policy network (not necessarily within kwargs
, but explicitly). Then you "save" them in some net's attributes and only then you call the superclass' constructor. It is something that I actually do in my code, but I didn't report it previously because it was just a personal need.
You can do something a bit like this:
class MyPolicy(ActorCriticPolicy):
def __init__(
self,
observation_space: gym.spaces.Space,
action_space: gym.spaces.Space,
lr_schedule: Callable[[float], float],
net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
activation_fn: Type[nn.Module] = nn.Tanh,
dim1,
dim2,
*args,
**kwargs,
):
self.dim1 = dim1
self.dim2 = dim2
super().__init__()
...
@wlxer I think you could pass the dimensions as parameters to your policy network (not necessarily within
kwargs
, but explicitly). Then you "save" them in some net's attributes and only then you call the superclass' constructor. It is something that I actually do in my code, but I didn't report it previously because it was just a personal need.You can do something a bit like this:
class MyPolicy(ActorCriticPolicy): def __init__( self, observation_space: gym.spaces.Space, action_space: gym.spaces.Space, lr_schedule: Callable[[float], float], net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None, activation_fn: Type[nn.Module] = nn.Tanh, dim1, dim2, *args, **kwargs, ): self.dim1 = dim1 self.dim2 = dim2 super().__init__() ...
Thank you very much for your prompt answer! This is indeed a good solution, because actually I should pre-set different feature dimensions as part of the "MyPolicy" network input. I will try to do that. Thanks for your help!
@wlxer it's actually similar do what shown in #906, but the parameters are not in kwargs
. This way you don't have to delete the corresponding entries from kwargs
before passing in to the super-constructor, so I think it's a bit cleaner, if those parameters are actually required.
I will try to do that. Thanks for your help!
Let me know if it works! 😃
@araffin @wlxer actually I found a mistake in the snippet I wrote, so I corrected it (even more so because in the documentation there's a link to such snippet).
Basically, the super-constructor of the custom policy (i.e., ActorCriticPolicy
's constructor) calls a _build
method, which creates a features extractor (which could be a default one, or you could pass a features_extractor_class
, but it wouldn't make sense because it would be a shared one, and here we're discussing about having a non-shared features extractor).
This is not a problem per se, because we just overwrite the features extractor after the call to the super-constructor, but the features_dim
of the features extractor just created in the super-constructor (actually in the _build
method that it calls) is used to create the mlp_extractor
, but such features_dim
is wrong. Furthremore, the _build
method called by the super-constructor also creates the optimizer, which takes self.parameters()
as argument, but those parameters include the ones of the "wrong" features extractor and the ones of the "wrong" mlp_extractor
("wrong" beacuse of the wrong features_dim
), and do not include the params of the custom features extractors that are still to be created.
I added to the custom policy a _rebuild
method, which does most of what _build
does, but it does it after the call to the super-constructor, therefore with correct features extractors and correct features_dim
, overwriting the wrong mlp_extractor
and the wrong optimizer.
@araffin it would be great if you could have a look to the updated snippet as a confirmation.
(and for the future, allowing separate feature extractor for on-policy algorithms is probably a reasonable feature)
Furthermore, I'll work on a PR to make possible to have non-shared features extractors in a simple way
Hi, I've also been experimenting with unshared feature extractor and policies lately, and I also noticed that your code has been merged into the latest branch, which is really awesome. I also have a small question about
self.ortho_init = False
would False or True make any difference here? @AlexPasqua
Hi, I've also been experimenting with unshared feature extractor and policies lately, and I also noticed that your code has been merged into the latest branch, which is really awesome. I also have a small question about
self.ortho_init = False
would False or True make any difference here? @AlexPasqua
Hello, I don't think it makes any difference in that sense (apart from the initialization). In fact I think in the current version of the code there's no self.ortho_init = False
anymore. I think that line came from the (old) guide on how to build a custom policy that was on the docs, which was a starting point for me to write that script.
Also, that script might be outdated now (or anyway it's not necessary to write all that code anymore), everything is integrated in the library, so I'd look at what the docs say now
In fact I think in the current version of the code there's no self.ortho_init = False anymore. I think that line came from the (old) guide on how to build a custom policy that was on the docs, which was a starting point for me to write https://github.com/DLR-RM/stable-baselines3/issues/1066#issuecomment-1246866844.
You are right, I found that the new version of SB3 already has pi_feature_extracter and vf_feature_extractor, and set ortho_init=true , but I still want to know why in this script it was set to False at that time. And, actually, even on top of the new SB3 library, if I want to process graph type data (e.g. Data in PyG), I still need to inherit and rewrite the whole ActorCritic Policy, because the feature extractor, though separate, can only be generated by make_feature_extractor(). I wonder if you have any good suggestions?
In fact I think in the current version of the code there's no self.ortho_init = False anymore. I think that line came from the (old) guide on how to build a custom policy that was on the docs, which was a starting point for me to write #1066 (comment).
You are right, I found that the new version of SB3 already has pi_feature_extracter and vf_feature_extractor, and set ortho_init=true , but I still want to know why in this script it was set to False at that time.
As mentioned in my previous comment, I think that line was present on an old version of the SB3 documentation, in a guide on how to create a custom policy. I started from that guide and then made changes to have separate features extractors, but that line I simply copied it.
And, actually, even on top of the new SB3 library, if I want to process graph type data (e.g. Data in PyG), I still need to inherit and rewrite the whole ActorCritic Policy, because the feature extractor, though separate, can only be generated by make_feature_extractor(). I wonder if you have any good suggestions?
When the policy creates the features extractor(s) through make_features_extractor
, it does the following:
return self.features_extractor_class(self.observation_space, **self.features_extractor_kwargs)
So I guess you could create a class representing your feature extractor working with graphs, and then pass that class as the features_extractor_class
argument when creating the policy.
EDIT: fixed in https://github.com/DLR-RM/stable-baselines3/pull/1148
Question
I've checked the docs (custom policy -> advanced example), but it is not clear to me how to create a custom policy without sharing the features extractor between the actor and the critic networks in on-policy algorithms.
If I pass a
features_extractor_class
in thepolicy_kwargs
, this is shared by default I think.I can have a non-shared
mlp_extractor
by implementing my own_build_mlp_extractor
method in my custom policy and creating a network with 2 distinct sub-networks (self.policy_net
andself.value_net
), but I didn't understand how to do the same with the features extractor.On the docs (custom policy -> custom features extractor), it says: Therefore, since I'm using A2C, I think it should be possible to have a non-shared features extractor by implementing my own policy, just I didn't understand how to do it.
Thanks in advance any clarification!
Checklist