Kaixhin / Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning
MIT License
1.56k stars 282 forks source link

Handling of terminal state #11

Closed marintoro closed 6 years ago

marintoro commented 6 years ago

I don't really understand what you did to the calcul of the nonterminals state in the last commit (line 128 in memory.py).

nonterminals = self.dtype_float([transition.nonterminal for transition in full_transitions[self.history + self.n - 1]]).unsqueeze(1)

What if the current state is just one step before a terminal state? The full_transitions[self.history] will be terminal but not all the next one (cause you only postappend one frame as terminal in memory) and particulary the full_transitions[self.history + self.n - 1] will not be terminal...

In fact I think that the safest way to handle terminal_state is to postappend self.n frames (and set them as terminal) and not only one (exactly in the same way that in preappend where you add self.history frames and not only one).

Indeed if you got self.n>>self.history, the current computation of the returns will be wrong cause you don't check if we reach a terminal state and we could go in the next episode reward (when self.n<self.history this bug is hidden by the fact that you preappend self.history frames with 0 reward at the beginning of each episode)

Kaixhin commented 6 years ago

Yep you found the bug that doesn't actually affect these hyperparameters so I never sorted it, but I probably should 😛 Your suggestion for post-appending self.n frames is ideal, because it not only sorts out this properly, but it also works directly with the multi-step return code, calculating the correct truncated returns near the end of an episode. And of course by dropping any transitions too close to the current buffer index, anything validly* sampled around the end of an episode should have everything in place 😁

* I haven't checked for certain that transitions with 0 probability are not returned, but for now I can at least drop those too (2c535a6).