About the computation of Advantage and State Value in PPO

Khrylx / PyTorch-RL

PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.

MIT License

1.09k stars 186 forks source link

About the computation of Advantage and State Value in PPO #6

Closed mjbmjb closed 6 years ago

mjbmjb commented 6 years ago

In your implementation of Critic, you feed the network of the observation and action and output 1-dim value. Can I make the inference that It is Q(s,a) ? But the advantage you given is values = self.critic_target(states_var, actions_var).detach() advantages = rewards_var - values It is the estimation of q_t minus Q(s_t,a) I think it should be Advantage = q_t - V(s_t)

Khrylx commented 6 years ago

Which code are you talking about? I didn't use action as input to my value network.

mjbmjb commented 6 years ago

Sorry for mistaken, close it now