TianhongDai / esil-hindsight

This is the official code of our paper "Episodic Self-Imitation Learning with Hindsight" [Electronics 2020].
MIT License
7 stars 2 forks source link

rewards #5

Closed quyouyuan closed 2 years ago

quyouyuan commented 2 years ago

Hello! Excuse me l! The problem solved in this environment is the sparse reward problem. Why add a number to these rewards? Can these two lines explain? I never understood! Thank you for your reply!

https://github.com/TianhongDai/esil-hindsight/blob/94a7e10bd967fcd91e0b2e53c39cd41e0f14f5df/rl_base/ppo_agent.py#L133 https://github.com/TianhongDai/esil-hindsight/blob/94a7e10bd967fcd91e0b2e53c39cd41e0f14f5df/rl_base/ppo_agent.py#L134

TianhongDai commented 2 years ago

Hi @quyouyuan - this two lines are used to calculate the discounted return for each timestep. Which will be used to calculate the advantage functions and used in the trajectory selection module (which can be found from the paper in Eq.6 and Eq.10).

The details can be found in Eq.1. If you have further questions, please let me know.

quyouyuan commented 2 years ago

OK!OK!thanks for your reply!