Question about value_bootstrap

Denys88 / rl_games

RL implementations

MIT License

820 stars 138 forks source link

Question about value_bootstrap #242

Closed johannespitz closed 1 year ago

johannespitz commented 1 year ago

Is multiplying the value by gamma here mathematically motivated? https://github.com/Denys88/rl_games/blob/811af5c19e9f7ebf7ff0725512bfc64e52439fe8/rl_games/common/a2c_common.py#L683

Intuitively, I would argue that not multiplying with gamma makes more sense because then the advantage that is later calculated will be exactly the reward at that time step, instead of slightly more or less (depending on the sign of the value). https://github.com/Denys88/rl_games/blob/811af5c19e9f7ebf7ff0725512bfc64e52439fe8/rl_games/common/a2c_common.py#L525 Sort of: 'we don't have information about the future values, so they should have no effect'

Or am I missing something and the value at that last time step does provide useful information?

Denys88 commented 1 year ago

hi @johannespitz My thought was that if not multiply reward it will diverge. Like imaging infinite time limit. What is the value?

johannespitz commented 1 year ago

Hi @Denys88 Thanks for getting back to me. When I posted the issue I was only thinking about the effect on the policy training. But you are right, since you use the shaped_reward for updating the value function that could be an unintended side effect. I guess the right think to do, would be to use gamma, but compute the value of the next observation. However, in that time step (when done=True) I am returning the first observation of the next episode already, therefore I don't have that last observation available right now. Is this the intended way of using the repo? I assumed it would be because otherwise there would a "broken tuple" in the replay buffer for every episode reset.

Denys88 commented 1 year ago

As I remember deepmind control returns next_observation like termination didnt happen in the info. In this case it is better to use it instead of the current but I don't expect it will change a lot. It should be very different in tasks like this. IsaacGym doesnt provide it but need to request this feature.

johannespitz commented 1 year ago

Yeah, probably won't make much of a difference (I was just curious if there was a mathematical reason). So, I'm closing this issue.

However, I noticed that the OmniIsaacGymEnvs return the wrong observation when done=True (because the reset happens in the pre_physics_step). Maybe we should open an issue there.