Khrylx / PyTorch-RL

PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.
MIT License
1.09k stars 186 forks source link

about the kl #21

Closed yangyiqin-tsinghua closed 4 years ago

yangyiqin-tsinghua commented 4 years ago
def get_kl(self, x):
    action_prob1 = self.forward(x)
    action_prob0 = action_prob1.detach()
    kl = action_prob0 * (torch.log(action_prob0) - torch.log(action_prob1))
    return kl.sum(1, keepdim=True)

Shouldn't kl be two different strategies? There action_prob1 == action_prob0?? Thank you

Khrylx commented 4 years ago

In the TRPO paper, the Hessian of the KL is computed at \theta = \theta_{old}. So the two action probs are the same in value, but the action_prob0 representing \pi_old is detached so that no gradient will flow from it. image

yangyiqin-tsinghua commented 4 years ago

Thank you very much!

yangyiqin-tsinghua commented 4 years ago

Thank you very much!