danijar / dreamerv3

Mastering Diverse Domains through World Models
https://danijar.com/dreamerv3
MIT License
1.28k stars 219 forks source link

[Question] Regarding training the reward head #58

Closed EdanToledo closed 1 year ago

EdanToledo commented 1 year ago

Hello,

I had a quick question regarding the reward head that I was hoping you could help clarify. In the diagrams in the paper, you seem to predict the reward of the current transition from the successive state i.e. if you have the transition (s_1, a_1, r_1, s_2) you would predict r_1 using the posterior state encoded from s_2. This makes sense in my head as you need the action and the hidden state processed to get the reward. But in the code, it seems that reward is shifted back by 1. What I mean by this is it seems in the code that you process the sequence of observations to give you a sequence of posteriors and then you train the reward head on the reward sequence using these posteriors where the reward sequence starts from r_1 but the first posterior in the sequence is z_1. So I was wondering if you pad the reward sequence in the beginning or I am misunderstanding how the world model works. I've attached an image to illustrate my confusion:

dreamer

In my image, the _init variables are just the initial masked variables and/or the learned starting state. I assume the prev_action in this case is just zeros.

Thank you.

danijar commented 1 year ago

Hi, good question. The time step alignment used in this repository is what is shown in the first line not the second line (and it's also what the newer Sutton & Barto book recommends):

r1=0, s1 -> a1 -> r2, s2 -> a2 -> ...
s1 -> a1 -> r1, s2 -> a2 -> ...

In other words, the reward at some index is the consequence of the action one index earlier, not of the action at the same index.