DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.27k stars 1.71k forks source link

[Question] Non-shared features extractor in on-policy algorithm #1066

Closed AlexPasqua closed 2 years ago

AlexPasqua commented 2 years ago

EDIT: fixed in https://github.com/DLR-RM/stable-baselines3/pull/1148

Question

I've checked the docs (custom policy -> advanced example), but it is not clear to me how to create a custom policy without sharing the features extractor between the actor and the critic networks in on-policy algorithms.

If I pass a features_extractor_class in the policy_kwargs, this is shared by default I think.

I can have a non-shared mlp_extractor by implementing my own _build_mlp_extractor method in my custom policy and creating a network with 2 distinct sub-networks (self.policy_net and self.value_net), but I didn't understand how to do the same with the features extractor.

On the docs (custom policy -> custom features extractor), it says: image Therefore, since I'm using A2C, I think it should be possible to have a non-shared features extractor by implementing my own policy, just I didn't understand how to do it.

Thanks in advance any clarification!

Checklist

AlexPasqua commented 2 years ago

I managed to make it run without errors! :tada:

But since I haven't found a guide/demo nor a similar issue here, I'll briefly explain how I did it:

  1. Create a custom policy (in my case, subclassing from ActorCriticPolicy).
  2. In this custom policy, I created 2 attributes, one for the features extractor of the actor and one for the features extractor of the critic.
  3. Override the super-policy's methods that use the feature extractor, in order to use yours. In my case they were: forward, extract_features, evaluate_actions and predict_values).

Quick demo

class MyFeaturesExtractor(BaseFeaturesExtractor):
    ...

class MyMLPExtractor(nn.Module):
    # not mandatory, just in my case I had also a custom MLP extractor
    # because I had to use 2 different activation functions for the actor and the critic
    ...

class MyPolicy(ActorCriticPolicy):
    def __init__(
            self,
            observation_space: gym.spaces.Space,
            action_space: gym.spaces.Space,
            lr_schedule: Callable[[float], float],
            net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
            activation_fn: Type[nn.Module] = nn.Tanh,
            *args,
            **kwargs,
    ):
        super(MyPolicy, self).__init__(
            observation_space,
            action_space,
            lr_schedule,
            net_arch,
            activation_fn,
            # Pass remaining arguments to base class
            *args,
            **kwargs,
        )

        # non-shared features extractors for the actor and the critic
        self.policy_features_extractor = MyFeaturesExtractor(...)
        self.value_features_extractor = MyFeaturesExtractor(...)

        # if features_dim of both features extractor are the same:
        self.features_dim = self.policy_features_extractor.features_dim

        # otherwise:
        self.features_dim = {'pi': self.policy_features_extractor.features_dim,
                             'vf': self.value_features_extractor.features_dim}
        # NOTE: if the 2 features dims are different, your mlp_extractor must be able
        # to acceppt such dict AND ALSO an int, because the mlp_extractor will be first
        # created with wrong features_dim (coming from wrong, default, feratures extractor) which is an int.
        # Furthermore, note that with 2 different features dims the mlp_extractor cannot have shared layers.

        delattr(self, "features_extractor")  # remove the shared features extractor

        # Disable orthogonal initialization (if you want, otherwise comment it)
        self.ortho_init = False

        # The super-constructor calls a '_build' method that creates the network and the optimizer.
        # The problem is that it does so using a default features extractor, and not the ones just created,
        # therefore we need to re-create the mlp_extractor and the optimizer
        # (that otherwise would have used obsolete features_dims and parameters).
        self._rebuild(lr_schedule)

    def _rebuild(self, lr_schedule: Schedule) -> None:
        """ Re-creates the mlp_extractor and the optimizer for the model.

        :param lr_schedule: Learning rate schedule
            lr_schedule(1) is the initial learning rate
        """
        self._build_mlp_extractor()

        # action_net and value_net as created in the '_build' method are OK,
        # no need to recreate them.

        # Init weights: use orthogonal initialization
        # with small initial weight for the output
        if self.ortho_init:
            # TODO: check for features_extractor
            # Values from stable-baselines.
            # features_extractor/mlp values are
            # originally from openai/baselines (default gains/init_scales).
            module_gains = {
                self.policy_features_extractor: np.sqrt(2),
                self.value_features_extractor: np.sqrt(2),
                self.mlp_extractor: np.sqrt(2),
                self.action_net: 0.01,
                self.value_net: 1,
            }
            for module, gain in module_gains.items():
                module.apply(partial(self.init_weights, gain=gain))

        # Setup optimizer with initial learning rate
        self.optimizer = self.optimizer_class(self.parameters(), lr=lr_schedule(1), **self.optimizer_kwargs)

    def _build_mlp_extractor(self) -> None:
        self.mlp_extractor = MyMLPExtractor(...)

    def extract_features(self, obs: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
        """
        Preprocess the observation if needed and extract features.

        :param obs: Observation
        :return: the output of the feature extractor(s)
        """
        assert self.policy_features_extractor is not None and self.value_features_extractor is not None
        preprocessed_obs = preprocess_obs(obs, self.observation_space, normalize_images=self.normalize_images)
        policy_features = self.policy_features_extractor(preprocessed_obs)
        value_features = self.value_features_extractor(preprocessed_obs)
        return policy_features, value_features

    def forward(self, obs: th.Tensor, deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        """
        Forward pass in all the networks (actor and critic)

        :param obs: Observation
        :param deterministic: Whether to sample or use deterministic actions
        :return: action, value and log probability of the action
        """
        # Preprocess the observation if needed
        policy_features, value_features = self.extract_features(obs)
        latent_pi = self.mlp_extractor.forward_actor(policy_features)
        latent_vf = self.mlp_extractor.forward_critic(value_features)

        # Evaluate the values for the given observations
        values = self.value_net(latent_vf)
        distribution = self._get_action_dist_from_latent(latent_pi)
        actions = distribution.get_actions(deterministic=deterministic)
        log_prob = distribution.log_prob(actions)
        return actions, values, log_prob

    def evaluate_actions(self, obs: th.Tensor, actions: th.Tensor) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        """
        Evaluate actions according to the current policy,
        given the observations.

        :param obs: Observation
        :param actions: Actions
        :return: estimated value, log likelihood of taking those actions
            and entropy of the action distribution.
        """
        # Preprocess the observation if needed
        policy_features, value_features = self.extract_features(obs)
        latent_pi = self.mlp_extractor.forward_actor(policy_features)
        latent_vf = self.mlp_extractor.forward_critic(value_features)
        distribution = self._get_action_dist_from_latent(latent_pi)
        log_prob = distribution.log_prob(actions)
        values = self.value_net(latent_vf)
        return values, log_prob, distribution.entropy()

    def get_distribution(self, obs: th.Tensor) -> Distribution:
        """
        Get the current policy distribution given the observations.

        :param obs: Observation
        :return: the action distribution.
        """
        policy_features, _ = self.extract_features(obs)
        latent_pi = self.mlp_extractor.forward_actor(policy_features)
        return self._get_action_dist_from_latent(latent_pi)

    def predict_values(self, obs: th.Tensor) -> th.Tensor:
        """
        Get the estimated values according to the current policy given the observations.

        :param obs: Observation
        :return: the estimated values.
        """
        _, value_features = self.extract_features(obs)
        latent_vf = self.mlp_extractor.forward_critic(value_features)
        return self.value_net(latent_vf)

Hope it can help someone!

araffin commented 2 years ago

Hello, thanks for the contribution, I would be happy to receive a PR that add a link in the doc to this issue ;)

(and for the future, allowing separate feature extractor for on-policy algorithms is probably a reasonable feature)

AlexPasqua commented 2 years ago

I would be happy to receive a PR that add a link in the doc to this issue ;)

Done in PR #1067 💪

wlxer commented 2 years ago

I would be happy to receive a PR that add a link in the doc to this issue ;)

Done in PR #1067 💪

Hello Recently, I also want to implement two independent feature extraction networks, and then connect different mlp extraction networks respectively. Your code helped me a lot!

But here I have a question, when the feature dimensions extracted by the two feature extraction networks are different, how can I elegantly place the two corresponding dimensions into _build_mlp_extractor() (to facilitate subsequent network construction) ?

At present, I have tried adding these two variables(e.g. "policy_feature_dim" and "value_feature_dim") directly in the initialization method of the MyPolicy class, and then I thought that these two variables can be easily called in "_build_mlp_extractor()" function: e.g. "self.mlp_extractor = MyMLPExtractor(self, self.policy_feature_dim, self.value_feature_dim)". However, since "_build_mlp_extractor()" will be called in advance in the "super(MyPolicy, self).init() "method, it is suggested that MyPolicy class does not have these two attributes(e.g. "policy_feature_dim" and "value_feature_dim") . One solution I can think of at the moment is that I manually add the corresponding feature dimension information to the "kwargs", just like Question#906. So is there a better way?

Looking forward to your reply, thank you very much!

AlexPasqua commented 2 years ago

@wlxer I think you could pass the dimensions as parameters to your policy network (not necessarily within kwargs, but explicitly). Then you "save" them in some net's attributes and only then you call the superclass' constructor. It is something that I actually do in my code, but I didn't report it previously because it was just a personal need.

You can do something a bit like this:

class MyPolicy(ActorCriticPolicy):
    def __init__(
                self,
                observation_space: gym.spaces.Space,
                action_space: gym.spaces.Space,
                lr_schedule: Callable[[float], float],
                net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
                activation_fn: Type[nn.Module] = nn.Tanh,
                dim1,
                dim2,
                *args,
                **kwargs,
        ):
    self.dim1 = dim1
    self.dim2 = dim2
    super().__init__()
    ...
wlxer commented 2 years ago

@wlxer I think you could pass the dimensions as parameters to your policy network (not necessarily within kwargs, but explicitly). Then you "save" them in some net's attributes and only then you call the superclass' constructor. It is something that I actually do in my code, but I didn't report it previously because it was just a personal need.

You can do something a bit like this:

class MyPolicy(ActorCriticPolicy):
    def __init__(
                self,
                observation_space: gym.spaces.Space,
                action_space: gym.spaces.Space,
                lr_schedule: Callable[[float], float],
                net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
                activation_fn: Type[nn.Module] = nn.Tanh,
                dim1,
                dim2,
                *args,
                **kwargs,
        ):
    self.dim1 = dim1
    self.dim2 = dim2
    super().__init__()
    ...

Thank you very much for your prompt answer! This is indeed a good solution, because actually I should pre-set different feature dimensions as part of the "MyPolicy" network input. I will try to do that. Thanks for your help!

AlexPasqua commented 2 years ago

@wlxer it's actually similar do what shown in #906, but the parameters are not in kwargs. This way you don't have to delete the corresponding entries from kwargs before passing in to the super-constructor, so I think it's a bit cleaner, if those parameters are actually required.

I will try to do that. Thanks for your help!

Let me know if it works! 😃

AlexPasqua commented 2 years ago

@araffin @wlxer actually I found a mistake in the snippet I wrote, so I corrected it (even more so because in the documentation there's a link to such snippet).

Quick description of the problem

Basically, the super-constructor of the custom policy (i.e., ActorCriticPolicy's constructor) calls a _build method, which creates a features extractor (which could be a default one, or you could pass a features_extractor_class, but it wouldn't make sense because it would be a shared one, and here we're discussing about having a non-shared features extractor).

This is not a problem per se, because we just overwrite the features extractor after the call to the super-constructor, but the features_dim of the features extractor just created in the super-constructor (actually in the _build method that it calls) is used to create the mlp_extractor, but such features_dim is wrong. Furthremore, the _build method called by the super-constructor also creates the optimizer, which takes self.parameters() as argument, but those parameters include the ones of the "wrong" features extractor and the ones of the "wrong" mlp_extractor ("wrong" beacuse of the wrong features_dim), and do not include the params of the custom features extractors that are still to be created.

The solution

I added to the custom policy a _rebuild method, which does most of what _build does, but it does it after the call to the super-constructor, therefore with correct features extractors and correct features_dim, overwriting the wrong mlp_extractor and the wrong optimizer.


@araffin it would be great if you could have a look to the updated snippet as a confirmation.

(and for the future, allowing separate feature extractor for on-policy algorithms is probably a reasonable feature)

Furthermore, I'll work on a PR to make possible to have non-shared features extractors in a simple way

wlxer commented 1 year ago

Hi, I've also been experimenting with unshared feature extractor and policies lately, and I also noticed that your code has been merged into the latest branch, which is really awesome. I also have a small question about self.ortho_init = False would False or True make any difference here? @AlexPasqua

AlexPasqua commented 1 year ago

Hi, I've also been experimenting with unshared feature extractor and policies lately, and I also noticed that your code has been merged into the latest branch, which is really awesome. I also have a small question about self.ortho_init = False would False or True make any difference here? @AlexPasqua

Hello, I don't think it makes any difference in that sense (apart from the initialization). In fact I think in the current version of the code there's no self.ortho_init = False anymore. I think that line came from the (old) guide on how to build a custom policy that was on the docs, which was a starting point for me to write that script.

Also, that script might be outdated now (or anyway it's not necessary to write all that code anymore), everything is integrated in the library, so I'd look at what the docs say now

wlxer commented 1 year ago

In fact I think in the current version of the code there's no self.ortho_init = False anymore. I think that line came from the (old) guide on how to build a custom policy that was on the docs, which was a starting point for me to write https://github.com/DLR-RM/stable-baselines3/issues/1066#issuecomment-1246866844.

You are right, I found that the new version of SB3 already has pi_feature_extracter and vf_feature_extractor, and set ortho_init=true , but I still want to know why in this script it was set to False at that time. And, actually, even on top of the new SB3 library, if I want to process graph type data (e.g. Data in PyG), I still need to inherit and rewrite the whole ActorCritic Policy, because the feature extractor, though separate, can only be generated by make_feature_extractor(). I wonder if you have any good suggestions?

AlexPasqua commented 1 year ago

In fact I think in the current version of the code there's no self.ortho_init = False anymore. I think that line came from the (old) guide on how to build a custom policy that was on the docs, which was a starting point for me to write #1066 (comment).

You are right, I found that the new version of SB3 already has pi_feature_extracter and vf_feature_extractor, and set ortho_init=true , but I still want to know why in this script it was set to False at that time.

As mentioned in my previous comment, I think that line was present on an old version of the SB3 documentation, in a guide on how to create a custom policy. I started from that guide and then made changes to have separate features extractors, but that line I simply copied it.

And, actually, even on top of the new SB3 library, if I want to process graph type data (e.g. Data in PyG), I still need to inherit and rewrite the whole ActorCritic Policy, because the feature extractor, though separate, can only be generated by make_feature_extractor(). I wonder if you have any good suggestions?

When the policy creates the features extractor(s) through make_features_extractor, it does the following:

return self.features_extractor_class(self.observation_space, **self.features_extractor_kwargs)

So I guess you could create a class representing your feature extractor working with graphs, and then pass that class as the features_extractor_class argument when creating the policy.