Khrylx / PyTorch-RL

PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.
MIT License
1.09k stars 186 forks source link

Doubt regarding the calculation of advantage #23

Closed nesarasr closed 4 years ago

nesarasr commented 4 years ago

Hey, thanks for this great repository! I am a beginner in RL and I am trying to understand the practical implementation of TRPO. What is the purpose of multiplying the variable 'mask' while computing advantage (in estimate_advantage() function)? And what range of values do 'masks' take?

Thanks!

Khrylx commented 4 years ago

Since we are concatenating all episodes together inside the batch, mask is to make sure that rewards from previous episodes are not added to the current episode.

nesarasr commented 4 years ago

Got it, thank you!!