Closed quyouyuan closed 2 years ago
Hi @quyouyuan - this two lines are used to calculate the discounted return for each timestep. Which will be used to calculate the advantage functions and used in the trajectory selection module (which can be found from the paper in Eq.6 and Eq.10).
The details can be found in Eq.1. If you have further questions, please let me know.
OK!OK!thanks for your reply!
Hello! Excuse me l! The problem solved in this environment is the sparse reward problem. Why add a number to these rewards? Can these two lines explain? I never understood! Thank you for your reply!
https://github.com/TianhongDai/esil-hindsight/blob/94a7e10bd967fcd91e0b2e53c39cd41e0f14f5df/rl_base/ppo_agent.py#L133 https://github.com/TianhongDai/esil-hindsight/blob/94a7e10bd967fcd91e0b2e53c39cd41e0f14f5df/rl_base/ppo_agent.py#L134