Closed watermeloncq closed 3 years ago
Hello, you are using the on-policy example where as DDPG is off-policy. Depending on how complex is your custom policy, you can take a look at the off policy example: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#off-policy-algorithms and for more customisation (for instance adding dropout), best is to take a look at the code: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/td3/policies.py#L85
I would also highly recommend you to read more about DDPG because the current architecture you wrote here cannot work with it (the actor requires a tanh at the output and the critic is q value function that needs the action as input): https://spinningup.openai.com/en/latest/algorithms/ddpg.html
To answer your original request, giving an advanced customization example for any off-policy algorithm is not really possible as each algorithm would need a different one. Also, customizing the network more than what we present requires good knowledge about the algorithm, that's why i would favor reading the code in that case (we should probably add such warning in the doc).
No thing is special for the DDPG algorithm that I used. At the beginning of user guide for custom policy, it does not says that the example shows on the page is only for on-policy algorithm. I firstly defined a class named CustomCNN to extract features from observation, then I defined CustomNetwork class and CustomActorCriticPolicy class follows the instruction of user guide. However, when I follow the guide to define DDPG model, the errors happen. You could see that all the code I showed above conform to the user guide of customizing policy networks.
If what you said is true ( Giving an advanced customization example for any off-policy algorithm is not really possible), I suggest the instruction of customizing policy networks for off-policy algorithm should be removed from the guide page, or else, it will confused users.
Thank you for your advice and honest answer, best wishes.
I firstly defined a class named CustomCNN to extract features from observation, However, when I follow the guide to define DDPG model,
Custom feature extractor is working the same for on/off-policy algorithm, that's why there is no specific warning,
however the advanced CustomActorCriticPolicy
code is under the "On-Policy Algorithms" section.
Depending on how complex is your custom policy, you can take a look at the off policy example: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#off-policy-algorithms
Custom feature extractor is working the same for on/off-policy algorithm, that's why there is no specific warning, however the advanced CustomActorCriticPolicy code is under the "On-Policy Algorithms" section.
Behind Custom feature extractor, there should be a fully-connected network in On/Off-Policy Algorithms. According to the user guide, you could see that we can define dropout layer in CustomNetwork class which only works in On-Policy Algorithms, but in Off-Policy Algorithms, only the formatted code like 'dict(qf=[400, 300], pi=[64, 64])' can be used, in which the dropout layer cannot be defined. So, if I want to define dropout layer in the fully-connected network, how to do it in Off-Policy Algorithms?
So, if I want to define dropout layer in the fully-connected network, how to do it in Off-Policy Algorithms?
As I wrote before, for that you would need to take a look at the code. For instance for DDPG/TD3, this would look like that:
class CustomActor(Actor):
"""
Actor network (policy) for TD3.
"""
def __init__(self, *args, **kwargs):
super(CustomActor, self).__init__(*args, **kwargs)
# Define custom network with Dropout
# WARNING: it must end with a tanh activation to squash the output
self.mu = nn.Sequential(...)
class CustomContinuousCritic(BaseModel):
"""
Critic network(s) for DDPG/SAC/TD3.
"""
def __init__(
self,
observation_space: gym.spaces.Space,
action_space: gym.spaces.Space,
net_arch: List[int],
features_extractor: nn.Module,
features_dim: int,
activation_fn: Type[nn.Module] = nn.ReLU,
normalize_images: bool = True,
n_critics: int = 2,
share_features_extractor: bool = True,
):
super().__init__(
observation_space,
action_space,
features_extractor=features_extractor,
normalize_images=normalize_images,
)
action_dim = get_action_dim(self.action_space)
self.share_features_extractor = share_features_extractor
self.n_critics = n_critics
self.q_networks = []
for idx in range(n_critics):
# q_net = create_mlp(features_dim + action_dim, 1, net_arch, activation_fn)
# Define critic with Dropout here
q_net = nn.Sequential(...)
self.add_module(f"qf{idx}", q_net)
self.q_networks.append(q_net)
def forward(self, obs: th.Tensor, actions: th.Tensor) -> Tuple[th.Tensor, ...]:
# Learn the features extractor using the policy loss only
# when the features_extractor is shared with the actor
with th.set_grad_enabled(not self.share_features_extractor):
features = self.extract_features(obs)
qvalue_input = th.cat([features, actions], dim=1)
return tuple(q_net(qvalue_input) for q_net in self.q_networks)
def q1_forward(self, obs: th.Tensor, actions: th.Tensor) -> th.Tensor:
"""
Only predict the Q-value using the first network.
This allows to reduce computation when all the estimates are not needed
(e.g. when updating the policy in TD3).
"""
with th.no_grad():
features = self.extract_features(obs)
return self.q_networks[0](th.cat([features, actions], dim=1))
class CustomTD3Policy(TD3Policy):
def __init__(self, *args, **kwargs):
super(CustomTD3Policy, self).__init__(*args, **kwargs)
def make_actor(self, features_extractor: Optional[BaseFeaturesExtractor] = None) -> CustomActor:
actor_kwargs = self._update_features_extractor(self.actor_kwargs, features_extractor)
return CustomActor(**actor_kwargs).to(self.device)
def make_critic(self, features_extractor: Optional[BaseFeaturesExtractor] = None) -> CustomContinuousCritic:
critic_kwargs = self._update_features_extractor(self.critic_kwargs, features_extractor)
return CustomContinuousCritic(**critic_kwargs).to(self.device)
# To register a policy, so you can use a string to create the network
# TD3.policy_aliases["CustomTD3Policy"] = CustomTD3Policy
You can find a complete example for SAC here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/feat/densemlp/utils/networks.py (EDIT by @qgallouedec : You have to replace register_policy
(that no longer exists) by the lines given here https://github.com/DLR-RM/stable-baselines3/issues/1126#issuecomment-1282280749, as shown in the code above)
Thank you for explaining it in detail, I will try it this weekend.
hello @araffin, is it possible to use Softmax instead of Tanh as the output layer of policy in TD3 (or SAC) algorithm? All values in my action vector must be from 0 to 1, and they should sum to 1 (exactly what SOftmax would give me)
Hello @araffin, if i add a feature extractor network, will the feature extractor layer also trained (Will the weights of the feature extractor network updated everytime the actor and critic network is updated)?
will the feature extractor layer also trained (Will the weights of the feature extractor network updated everytime the actor and critic network is updated)?
yes, it will. For A2C/PPO, there will be a shared feature extractor that will both use policy and critic loss, for SAC/TD3, if you use latest SB3 version (and you should ;)), there will be independent copies of the features extractor, one for the actor and one for the critic.
EDIT: if you want to freeze the features extractor, it is also possible, but I would recommend to use a wrapper (gym wrapper or VecEnvWrapper
) in that case to save memory and computation.
I try to migrate my paper code to stable baselines3, the original code of my paper runs well. And in stable baselines3, my custom environment has passed check_env. In particular, I found that most of scholars and researchers haven't been aware of the importance of customizing neural networks for complex mission when using deep reinforcement learning. And I think the user guide of stable baselines3 have not clearly told users how to customize policy networks for Off-Policy Algorithms. It is necessary that a full customizing policy example for Off-Policy Algorithms should be shown in the user guide or in example code, otherwise, it will confuses users. According to the analysis above, in order to explain the problem more clearly and hope a full example added in user guide, I would like to paste all my custom policy networks code below.
Describe the bug
I follow the doc of stable baselines3 to customize policy network for DDPG algorithm, but it always make errors when defining DDPG model. If I remove action_noise parameter, it will appear another error. The code shows below (all the code follows the user guide):
Code example
I got the following error:
System Info
Checklist