Khrylx / PyTorch-RL

PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.
MIT License
1.09k stars 186 forks source link

Confusion about advantage computation #16

Closed gunshi closed 5 years ago

gunshi commented 5 years ago

Hey! I'm a bit confused about why in the code to compute advantages, the previous advantage value is being set to the value of the first env's advantage from the previous time step, ie advantages[i, 0] (assuming that advantages are structured in dimension/size as (time_steps X num_envs X 1 ))

https://github.com/Khrylx/PyTorch-RL/blob/15b574f5d52f5eeab6917c90c17e8739578f3d96/core/common.py#L17 Could you link the source for the equations for this whole function? Thanks! Gunshi