[Question] How to insert current observation and action in Custom Policy Network?

rockingapple commented 3 years ago

Question

When I follow the tutorial to customize Policy Network, I found the guide do not explain how to add current observation and action in Custom Policy Network.

Here is the code for example :

class DeterministicCriticNet(nn.Module, BasicNet):   # It's the definition of Critic Network in DDPG
    def __init__(self,
                 state_dim,
                 action_dim,
                 gpu=False,
                 batch_norm=False,
                 non_linear=F.relu):
        super(DeterministicCriticNet, self).__init__()
         ...  #Omit other codes

    def forward(self, x, action):
        x = self.to_torch_variable(x)  # get current observation
        action = self.to_torch_variable(action)[:,None,None,:-1]  # get current action
        ...  #Omit other codes

As the code shows above, x is the current observation when agent take action. However, according to the stable-baselines3's official guide, I can not find a way to insert current observation and action to Policy Network Is there any way to make the policy-customizing more flexible? I want to inset current observation and action in Custom Policy Network. Hope any person could give me an answer, thanks.

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)

araffin commented 3 years ago

Hello,

I found the guide do not explain how to add current observation and action in Custom Policy Network.

I'm not sure to get what you mean or what you want to do with obs and action... The code you are showing does not look like what is shown in the doc (and the link to the doc point to custom envs, not custom policy) and corresponds to the critc for DDPG (when using continuous actions).

You may check that issue: https://github.com/DLR-RM/stable-baselines3/issues/285 and check the documentation for off-policy custom network here: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#off-policy-algorithms

watermeloncq commented 3 years ago

Hello,

I found the guide do not explain how to add current observation and action in Custom Policy Network.

I'm not sure to get what you mean or what you want to do with obs and action... The code you are showing does not look like what is shown in the doc (and the link to the doc point to custom envs, not custom policy) and corresponds to the critc for DDPG (when using continuous actions).

You may check that issue: #285 and check the documentation for off-policy custom network here: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#off-policy-algorithms

I am sorry to give the wrong link, the right link should be here which is lack of flexibility when customizing policy network. Yes, you are right, the code I showing above does not in the doc of stable baselines3. It is my paper code which comes from here. My paper code based on it and I make some modification.

I plan to migrate my code to stable baselines3, therefore, I need more flexible Custom Policy Network. For instance, in my paper, the observation is represented by the combination of X and Action，among them, X is a price tensor, Action is a weight vector. In my paper's DDPG algorithm, the input of Actor is the price tensor X, and the input of Critic is the combination of X and Action. That is to say, the input of Actor and Critic in DDPG are different. However, according to the doc of stable baselines3, I can not make different inputs when customizing Actor and Critic in DDPG algorithm. What should I do?

araffin commented 3 years ago

the input of Actor is the price tensor X, and the input of Critic is the combination of X and Action. That is to say, the input of Actor and Critic in DDPG are different.

well, that's the definition of the actor and the critic... and that's already the case for SB3. See TD3 (improved version of DDPG):

watermeloncq commented 3 years ago

Thank you. Hope the doc of SB3 will add the tutorial section of customizing Actor and Critic soon.

araffin commented 3 years ago

Again, as mentioned in https://github.com/DLR-RM/stable-baselines3/issues/285, I'm not sure what is missing from the documentation:

if you want to modify the feature extractor (for instance use a custom cnn), it is documented (and the feature extractor can be shared or separated for the actor/critic, cf doc)
if you want to modify the architecture after the feature extractor (usually mlp), it is also documented...

watermeloncq commented 3 years ago

Thank you for be patience to answer me. I mean that I want directly operate the input (observation) in the section of Custom Policy Network, that is to say, the obeservation should be the parameter of class CustomNetwork. If only like the doc shows that changing the number of neural layers and units, it is not enough to customize policy in SB3. You guys really did a great job to create SB3 which is attractive to me, I just want it be more flexible.

araffin commented 3 years ago

I mean that I want directly operate the input (observation) in the section of Custom Policy Network, that is to say, the obeservation should be the parameter of class CustomNetwork

I still don't get your point. Observation is passed to both the actor and the critic (and the critic get the action in addition). Both objects also have access to the observation and action spaces.

If only like the doc shows that changing the number of neural layers and units, it is not enough to customize policy in SB3.

If you want to modify the observation, then use a gym wrapper, if you want more flexibility, you have the feature extractor (did you take a look at the example?).

If you want to do something even more fancy (let say pass the action at different stage of the critic using residual connections), then please take a look at the developer guide and then the code (which is commented) and then derive a custom policy object.

I just want it be more flexible.

But at the end, the actor must output a valid action and the critic must output an action-value, so you cannot have too much flexibility anyway.

watermeloncq commented 3 years ago

If you want to modify the observation, then use a gym wrapper, if you want more flexibility, you have the feature extractor (did you take a look at the example?).

It is really inspire me, I will have a try, thank you.

DLR-RM / stable-baselines3