Loss questions - Githubissues

wassname commented 7 years ago

I just went throught your code and the PPO paper and have a few questions, perhaps if you have time you could comment.

First off, nice work. The code is easy to read with lots of comments and full variable names, it made it easy for me to read through and (partially) understand.
Should you take away lossvalue or add it? We want each individual loss to make the overall loss larger as they get larger. I can see you copied baselines exactly but maybe they have it wrong, In the PPO paper they have minus on one term (eq9). If you stop the training and inspect loss_clip and loss_value the first is negative, the second is positive. So it seems like we need to have loss=loss_value-loss_clip. Thoughts?
what's log_std, is that an exploration parameter set by the model?
Do we need loss_value? In the PPO paper they say that if we don't have shared parameter between the policy and value function then it's not needed (first paragraph of section 5). And your example model doesn't share parameters. An example of one that does is in baselines and it could halve your model parameters e.g.:

class Model(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(Model, self).__init__()
        h_size_1 = 100
        h_size_2 = 100
        self.fc1 = nn.Linear(num_inputs, h_size_1)
        self.fc2 = nn.Linear(h_size_1, h_size_2)
        self.mu = nn.Linear(h_size_2, num_outputs)
        self.log_std = nn.Parameter(torch.zeros(num_outputs))
        self.v = nn.Linear(h_size_2,1)
        for name, p in self.named_parameters():
            # init parameters
            if 'bias' in name:
                p.data.fill_(0)
            '''
            if 'mu.weight' in name:
                p.data.normal_()
                p.data /= torch.sum(p.data**2,0).expand_as(p.data)'''
        # mode
        self.train()

    def forward(self, inputs):
        # actor
        x = F.tanh(self.fc1(inputs))
        h = F.tanh(self.fc2(x))
        mu = self.mu(h)
        log_std = torch.exp(self.log_std).unsqueeze(0).expand_as(mu)
        # critic
        v = self.v(h)
        return mu, log_std, v

wassname commented 7 years ago

Actually, in the latest version of your code I get all losses are positive, so you can ignore point 2.

alexis-jacq commented 7 years ago

Hello, thanks a lot for the interest!

log_std is the log of the standard deviation. The output of the model are mean, log_std and v, where the mean and the std are used to sample an action following a normal distribution. In that sense, yes, the std forces the exploration around the learned means. I don't know by reading the paper if this parameter must be learned. But learned or not, its value stays small (std stays close to 1) and the performance seems not affected.

Regarding the value loss, if you don't learn it, the advantage means nothing, so you can't learn the policy. In the paper, they don't say that it is not required when parameters are not shared, but they rather say that if parameters are shared, then we must train the value and the policy in the same gradient step (which makes sense). As far as I understand, in baseline, they don't share parameters in the default version, they just do it for the cnn version (for Atari games). Also, with my implementation, I found even lower performances when I share parameters (like in your suggested code).

wassname commented 7 years ago

That makes sense, thanks for the explanations!

alexis-jacq / Pytorch-DPPO

Loss questions #4