danijar / dreamerv3

Mastering Diverse Domains through World Models
https://danijar.com/dreamerv3
MIT License
1.28k stars 219 forks source link

Meaning of `is_last` and `is_terminal` and their effect on the value function #87

Closed Mauhn-Bjarne closed 1 year ago

Mauhn-Bjarne commented 1 year ago

Hey Danijar,

I noticed that some environments and some wrappers (for example the TimeLimit wrapper) set is_last to True but do not set is_terminal to True. From what I can tell, the is_last flag makes sure to reset the environment on the next iteration. But, it is the is_terminal flag which is used to set the continuation flag. Since the value function only looks at the continuation flag, are you not mixing rewards of different episodes when you reset without setting is_terminal to True?

What I thought the meaning of is_last should be: The episode ended, but there might have been additional rewards in the future, as such there is no need to update the value function at the last time step towards 0.

While I thought the is_terminal flag meant: The episode ended and there will be no more rewards in the future, as such you can update the value function at the final timestep towards 0.

As to my questions:

danijar commented 1 year ago

Yes, exactly. is_last means the episode ended (so the following time step will be from a different episode) whereas is_terminal indicates to the algorithm that it should not consider future rewards, e.g. a TD error should bootstrap against zero. In non-episodic tasks like many DMC tasks, is_last is set but is_terminal isn't. In episodic tasks like Atari, both are set.

Mauhn-Bjarne commented 1 year ago

Thank you for the explanation. Now, it is clear to me when to set is_last and is_terminal. I am however still a bit confused about the updates to the value function when is_terminal is set to False.

If at time step T, we set is_last=True and is_terminal=False, would it not be logical for V(s_T) to not receive an update? Because:

Your code, however, will look at the first rewards and values of the next episode and use these to update V(s_T). Why would you do this? Why don't you set the target for V(s_T) to V(s_T) itself, since we are not able to obtain a better target at the moment?

I appreciate the clarification :slightly_smiling_face: