alexis-jacq / Pytorch-DPPO

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286
MIT License
180 stars 40 forks source link

Loss questions #4

Closed wassname closed 7 years ago

wassname commented 7 years ago

I just went throught your code and the PPO paper and have a few questions, perhaps if you have time you could comment.

class Model(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(Model, self).__init__()
        h_size_1 = 100
        h_size_2 = 100
        self.fc1 = nn.Linear(num_inputs, h_size_1)
        self.fc2 = nn.Linear(h_size_1, h_size_2)
        self.mu = nn.Linear(h_size_2, num_outputs)
        self.log_std = nn.Parameter(torch.zeros(num_outputs))
        self.v = nn.Linear(h_size_2,1)
        for name, p in self.named_parameters():
            # init parameters
            if 'bias' in name:
                p.data.fill_(0)
            '''
            if 'mu.weight' in name:
                p.data.normal_()
                p.data /= torch.sum(p.data**2,0).expand_as(p.data)'''
        # mode
        self.train()

    def forward(self, inputs):
        # actor
        x = F.tanh(self.fc1(inputs))
        h = F.tanh(self.fc2(x))
        mu = self.mu(h)
        log_std = torch.exp(self.log_std).unsqueeze(0).expand_as(mu)
        # critic
        v = self.v(h)
        return mu, log_std, v
wassname commented 7 years ago

Actually, in the latest version of your code I get all losses are positive, so you can ignore point 2.

alexis-jacq commented 7 years ago

Hello, thanks a lot for the interest!

log_std is the log of the standard deviation. The output of the model are mean, log_std and v, where the mean and the std are used to sample an action following a normal distribution. In that sense, yes, the std forces the exploration around the learned means. I don't know by reading the paper if this parameter must be learned. But learned or not, its value stays small (std stays close to 1) and the performance seems not affected.

Regarding the value loss, if you don't learn it, the advantage means nothing, so you can't learn the policy. In the paper, they don't say that it is not required when parameters are not shared, but they rather say that if parameters are shared, then we must train the value and the policy in the same gradient step (which makes sense). As far as I understand, in baseline, they don't share parameters in the default version, they just do it for the cnn version (for Atari games). Also, with my implementation, I found even lower performances when I share parameters (like in your suggested code).

wassname commented 7 years ago

That makes sense, thanks for the explanations!