Too many errors when customizing policy, a full example for Off-Policy Algorithms should be added in user guide

I try to migrate my paper code to stable baselines3, the original code of my paper runs well. And in stable baselines3, my custom environment has passed check_env. In particular, I found that most of scholars and researchers haven't been aware of the importance of customizing neural networks for complex mission when using deep reinforcement learning. And I think the user guide of stable baselines3 have not clearly told users how to customize policy networks for Off-Policy Algorithms. It is necessary that a full customizing policy example for Off-Policy Algorithms should be shown in the user guide or in example code, otherwise, it will confuses users. According to the analysis above, in order to explain the problem more clearly and hope a full example added in user guide, I would like to paste all my custom policy networks code below.

Describe the bug

I follow the doc of stable baselines3 to customize policy network for DDPG algorithm, but it always make errors when defining DDPG model. If I remove action_noise parameter, it will appear another error. The code shows below (all the code follows the user guide):

Code example

import gym
import torch as th
import torch.nn as nn
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

env = myCustomEnv()
class CustomCNN(BaseFeaturesExtractor):
    """
    :param observation_space: (gym.Space)
    :param features_dim: (int) Number of features extracted.
        This corresponds to the number of unit for the last layer.
    """

    def __init__(self, 
                 observation_space: gym.spaces.Box, 
                 features_dim: int = 256):
        super(CustomCNN, self).__init__(observation_space, features_dim)
        # We assume CxHxW images (channels first)
        # Re-ordering will be done by pre-preprocessing or wrapper       
        n_input_channels = observation_space.shape[0]
#         num_act = observation_space.shape[2]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 64, kernel_size=3, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(2, 2, ceil_mode=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]

        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.ReLU())

    def forward(self, observations: th.Tensor) -> th.Tensor:
        return self.linear(self.cnn(observations))

from typing import Callable, Dict, List, Optional, Tuple, Type, Union
from stable_baselines3.common.policies import ActorCriticPolicy

class CustomNetwork(nn.Module):
    """
    Custom network for policy and value function.
    It receives as input the features extracted by the feature extractor.

    :param feature_dim: dimension of the features extracted with the features_extractor (e.g. features from a CNN)
    :param last_layer_dim_pi: (int) number of units for the last layer of the policy network
    :param last_layer_dim_vf: (int) number of units for the last layer of the value network
    """

    def __init__(
        self,
        feature_dim: int,
        last_layer_dim_pi: int = 64,
        last_layer_dim_vf: int = 64,
    ):
        super(CustomNetwork, self).__init__()

        # IMPORTANT:
        # Save output dimensions, used to create the distributions
        self.latent_dim_pi = last_layer_dim_pi
        self.latent_dim_vf = last_layer_dim_vf
        dropout_half = nn.Dropout(p=0.5)

        # Policy network
        self.policy_net = nn.Sequential(
            nn.Linear(feature_dim, 128), 
            nn.ReLU(),
            dropout_half(),
            nn.Linear(128, 128), 
            nn.ReLU(),
            dropout_half(),
            nn.Linear(128, last_layer_dim_pi), 
        )
        # Value network
        self.value_net = nn.Sequential(
            nn.Linear(feature_dim, 64), 
            nn.ReLU(),
            dropout_half(),
            nn.Linear(64, 64), 
            nn.ReLU(),
            dropout_half(),
            nn.Linear(64, last_layer_dim_vf), 
        )

    def forward(self, features: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
        """
        :return: (th.Tensor, th.Tensor) latent_policy, latent_value of the specified network.
            If all layers are shared, then ``latent_policy == latent_value``
        """
        return self.policy_net(features), self.value_net(features)

class CustomActorCriticPolicy(ActorCriticPolicy):
    def __init__(
        self,
        observation_space: gym.spaces.Space,
        action_space: gym.spaces.Space,
        lr_schedule: Callable[[float], float],
        net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
        activation_fn: Type[nn.Module] = nn.Tanh,
        *args,
        **kwargs,
    ):

        super(CustomActorCriticPolicy, self).__init__(
            observation_space,
            action_space,
            lr_schedule,
            net_arch,
            activation_fn,
            # Pass remaining arguments to base class
            *args,
            **kwargs,
        )
        # Disable orthogonal initialization
        self.ortho_init = False

    def _build_mlp_extractor(self) -> None:
        self.mlp_extractor = CustomNetwork(self.features_dim,
                                           last_layer_dim_pi = env.action_space.shape[-1],
                                           last_layer_dim_vf = 1,
                                          )

from stable_baselines3 import PPO, DDPG, TD3
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise

# The noise objects for DDPG
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.3 * np.ones(n_actions))

policy_kwargs = dict(
    features_extractor_class=CustomCNN,
    features_extractor_kwargs=dict(features_dim=3072),
    action_noise=action_noise
)

model = DDPG(CustomActorCriticPolicy, env, policy_kwargs=policy_kwargs, verbose=1)

I got the following error:

Using cuda device
Wrapping the env with a Monitor wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-125df06d8445> in <module>
----> 1 model = DDPG(CustomActorCriticPolicy, env, policy_kwargs=policy_kwargs, verbose=1)

~/anaconda3/envs/sblines3/lib/python3.8/site-packages/stable_baselines3/ddpg/ddpg.py in __init__(self, policy, env, learning_rate, buffer_size, learning_starts, batch_size, tau, gamma, train_freq, gradient_steps, action_noise, optimize_memory_usage, tensorboard_log, create_eval_env, policy_kwargs, verbose, seed, device, _init_setup_model)
    105 
    106         if _init_setup_model:
--> 107             self._setup_model()
    108 
    109     def learn(

~/anaconda3/envs/sblines3/lib/python3.8/site-packages/stable_baselines3/td3/td3.py in _setup_model(self)
    116 
    117     def _setup_model(self) -> None:
--> 118         super(TD3, self)._setup_model()
    119         self._create_aliases()
    120 

~/anaconda3/envs/sblines3/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py in _setup_model(self)
    175             optimize_memory_usage=self.optimize_memory_usage,
    176         )
--> 177         self.policy = self.policy_class(  # pytype:disable=not-instantiable
    178             self.observation_space,
    179             self.action_space,

<ipython-input-15-a1b8672c075d> in __init__(self, observation_space, action_space, lr_schedule, net_arch, activation_fn, *args, **kwargs)
     68     ):
     69 
---> 70         super(CustomActorCriticPolicy, self).__init__(
     71             observation_space,
     72             action_space,

TypeError: __init__() got an unexpected keyword argument 'action_noise'

System Info

pip install
GPU is GTX 2080s, CUDA version is 10.2
Python version is 3.8.8
PyTorch version is 1.8.1
Gym version is 0.18.0
OS is Ubuntu 18.04, and using 'pip install stable-baselines3[extra]' to install all relevant packages

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)
[x] I have checked my env using the env checker (required)
[x] I have provided a minimal working example to reproduce the bug (required)

Hello, you are using the on-policy example where as DDPG is off-policy. Depending on how complex is your custom policy, you can take a look at the off policy example: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#off-policy-algorithms and for more customisation (for instance adding dropout), best is to take a look at the code: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/td3/policies.py#L85

I would also highly recommend you to read more about DDPG because the current architecture you wrote here cannot work with it (the actor requires a tanh at the output and the critic is q value function that needs the action as input): https://spinningup.openai.com/en/latest/algorithms/ddpg.html

To answer your original request, giving an advanced customization example for any off-policy algorithm is not really possible as each algorithm would need a different one. Also, customizing the network more than what we present requires good knowledge about the algorithm, that's why i would favor reading the code in that case (we should probably add such warning in the doc).

No thing is special for the DDPG algorithm that I used. At the beginning of user guide for custom policy, it does not says that the example shows on the page is only for on-policy algorithm. I firstly defined a class named CustomCNN to extract features from observation, then I defined CustomNetwork class and CustomActorCriticPolicy class follows the instruction of user guide. However, when I follow the guide to define DDPG model, the errors happen. You could see that all the code I showed above conform to the user guide of customizing policy networks.

If what you said is true ( Giving an advanced customization example for any off-policy algorithm is not really possible), I suggest the instruction of customizing policy networks for off-policy algorithm should be removed from the guide page, or else, it will confused users.

Thank you for your advice and honest answer, best wishes.

I firstly defined a class named CustomCNN to extract features from observation, However, when I follow the guide to define DDPG model,

Custom feature extractor is working the same for on/off-policy algorithm, that's why there is no specific warning, however the advanced CustomActorCriticPolicy code is under the "On-Policy Algorithms" section.

Depending on how complex is your custom policy, you can take a look at the off policy example: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#off-policy-algorithms

Custom feature extractor is working the same for on/off-policy algorithm, that's why there is no specific warning, however the advanced CustomActorCriticPolicy code is under the "On-Policy Algorithms" section.

Behind Custom feature extractor, there should be a fully-connected network in On/Off-Policy Algorithms. According to the user guide, you could see that we can define dropout layer in CustomNetwork class which only works in On-Policy Algorithms, but in Off-Policy Algorithms, only the formatted code like 'dict(qf=[400, 300], pi=[64, 64])' can be used, in which the dropout layer cannot be defined. So, if I want to define dropout layer in the fully-connected network, how to do it in Off-Policy Algorithms?

So, if I want to define dropout layer in the fully-connected network, how to do it in Off-Policy Algorithms?

As I wrote before, for that you would need to take a look at the code. For instance for DDPG/TD3, this would look like that:

class CustomActor(Actor):
    """
    Actor network (policy) for TD3.
    """
    def __init__(self, *args, **kwargs):
        super(CustomActor, self).__init__(*args, **kwargs)
        # Define custom network with Dropout
        # WARNING: it must end with a tanh activation to squash the output
        self.mu = nn.Sequential(...)

class CustomContinuousCritic(BaseModel):
    """
    Critic network(s) for DDPG/SAC/TD3.
    """
    def __init__(
        self,
        observation_space: gym.spaces.Space,
        action_space: gym.spaces.Space,
        net_arch: List[int],
        features_extractor: nn.Module,
        features_dim: int,
        activation_fn: Type[nn.Module] = nn.ReLU,
        normalize_images: bool = True,
        n_critics: int = 2,
        share_features_extractor: bool = True,
    ):
        super().__init__(
            observation_space,
            action_space,
            features_extractor=features_extractor,
            normalize_images=normalize_images,
        )

        action_dim = get_action_dim(self.action_space)

        self.share_features_extractor = share_features_extractor
        self.n_critics = n_critics
        self.q_networks = []
        for idx in range(n_critics):
            # q_net = create_mlp(features_dim + action_dim, 1, net_arch, activation_fn)
            # Define critic with Dropout here
            q_net = nn.Sequential(...)
            self.add_module(f"qf{idx}", q_net)
            self.q_networks.append(q_net)

    def forward(self, obs: th.Tensor, actions: th.Tensor) -> Tuple[th.Tensor, ...]:
        # Learn the features extractor using the policy loss only
        # when the features_extractor is shared with the actor
        with th.set_grad_enabled(not self.share_features_extractor):
            features = self.extract_features(obs)
        qvalue_input = th.cat([features, actions], dim=1)
        return tuple(q_net(qvalue_input) for q_net in self.q_networks)

    def q1_forward(self, obs: th.Tensor, actions: th.Tensor) -> th.Tensor:
        """
        Only predict the Q-value using the first network.
        This allows to reduce computation when all the estimates are not needed
        (e.g. when updating the policy in TD3).
        """
        with th.no_grad():
            features = self.extract_features(obs)
        return self.q_networks[0](th.cat([features, actions], dim=1))

class CustomTD3Policy(TD3Policy):
    def __init__(self, *args, **kwargs):
        super(CustomTD3Policy, self).__init__(*args, **kwargs)

    def make_actor(self, features_extractor: Optional[BaseFeaturesExtractor] = None) -> CustomActor:
        actor_kwargs = self._update_features_extractor(self.actor_kwargs, features_extractor)
        return CustomActor(**actor_kwargs).to(self.device)

    def make_critic(self, features_extractor: Optional[BaseFeaturesExtractor] = None) -> CustomContinuousCritic:
        critic_kwargs = self._update_features_extractor(self.critic_kwargs, features_extractor)
        return CustomContinuousCritic(**critic_kwargs).to(self.device)

# To register a policy, so you can use a string to create the network
# TD3.policy_aliases["CustomTD3Policy"] = CustomTD3Policy

You can find a complete example for SAC here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/feat/densemlp/utils/networks.py (EDIT by @qgallouedec : You have to replace register_policy (that no longer exists) by the lines given here https://github.com/DLR-RM/stable-baselines3/issues/1126#issuecomment-1282280749, as shown in the code above)

Thank you for explaining it in detail, I will try it this weekend.

hello @araffin, is it possible to use Softmax instead of Tanh as the output layer of policy in TD3 (or SAC) algorithm? All values in my action vector must be from 0 to 1, and they should sum to 1 (exactly what SOftmax would give me)

Hello @araffin, if i add a feature extractor network, will the feature extractor layer also trained (Will the weights of the feature extractor network updated everytime the actor and critic network is updated)?

will the feature extractor layer also trained (Will the weights of the feature extractor network updated everytime the actor and critic network is updated)?

yes, it will. For A2C/PPO, there will be a shared feature extractor that will both use policy and critic loss, for SAC/TD3, if you use latest SB3 version (and you should ;)), there will be independent copies of the features extractor, one for the actor and one for the critic.

EDIT: if you want to freeze the features extractor, it is also possible, but I would recommend to use a wrapper (gym wrapper or VecEnvWrapper) in that case to save memory and computation.

DLR-RM / stable-baselines3