Closed Mauhn-Bjarne closed 1 year ago
Yes, exactly. is_last means the episode ended (so the following time step will be from a different episode) whereas is_terminal indicates to the algorithm that it should not consider future rewards, e.g. a TD error should bootstrap against zero. In non-episodic tasks like many DMC tasks, is_last is set but is_terminal isn't. In episodic tasks like Atari, both are set.
Thank you for the explanation. Now, it is clear to me when to set is_last
and is_terminal
.
I am however still a bit confused about the updates to the value function when is_terminal
is set to False
.
If at time step T
, we set is_last=True
and is_terminal=False
, would it not be logical for V(s_T)
to not receive an update?
Because:
is_terminal
is not set, so no update towards 0Your code, however, will look at the first rewards and values of the next episode and use these to update V(s_T)
.
Why would you do this? Why don't you set the target for V(s_T)
to V(s_T)
itself, since we are not able to obtain a better target at the moment?
I appreciate the clarification :slightly_smiling_face:
Hey Danijar,
I noticed that some environments and some wrappers (for example the
TimeLimit
wrapper) setis_last
toTrue
but do not setis_terminal
toTrue
. From what I can tell, theis_last
flag makes sure to reset the environment on the next iteration. But, it is theis_terminal
flag which is used to set the continuation flag. Since the value function only looks at the continuation flag, are you not mixing rewards of different episodes when you reset without settingis_terminal
toTrue
?What I thought the meaning of
is_last
should be: The episode ended, but there might have been additional rewards in the future, as such there is no need to update the value function at the last time step towards 0.While I thought the
is_terminal
flag meant: The episode ended and there will be no more rewards in the future, as such you can update the value function at the final timestep towards 0.As to my questions:
is_last
, but notis_terminal
will leek rewards from the next episode? Where I am specifically referring to the implementation of thescore
method in theVFunction
class.