Closed marintoro closed 6 years ago
Yep you found the bug that doesn't actually affect these hyperparameters so I never sorted it, but I probably should 😛 Your suggestion for post-appending self.n
frames is ideal, because it not only sorts out this properly, but it also works directly with the multi-step return code, calculating the correct truncated returns near the end of an episode. And of course by dropping any transitions too close to the current buffer index, anything validly* sampled around the end of an episode should have everything in place 😁
* I haven't checked for certain that transitions with 0 probability are not returned, but for now I can at least drop those too (2c535a6).
I don't really understand what you did to the calcul of the nonterminals state in the last commit (line 128 in memory.py).
What if the current state is just one step before a terminal state? The
full_transitions[self.history]
will be terminal but not all the next one (cause you only postappend one frame as terminal in memory) and particulary thefull_transitions[self.history + self.n - 1]
will not be terminal...In fact I think that the safest way to handle terminal_state is to postappend
self.n
frames (and set them as terminal) and not only one (exactly in the same way that in preappend where you addself.history
frames and not only one).Indeed if you got
self.n>>self.history
, the current computation of the returns will be wrong cause you don't check if we reach a terminal state and we could go in the next episode reward (whenself.n<self.history
this bug is hidden by the fact that you preappendself.history
frames with 0 reward at the beginning of each episode)