[Question] Regarding training the reward head

Hello,

I had a quick question regarding the reward head that I was hoping you could help clarify. In the diagrams in the paper, you seem to predict the reward of the current transition from the successive state i.e. if you have the transition (s_1, a_1, r_1, s_2) you would predict r_1 using the posterior state encoded from s_2. This makes sense in my head as you need the action and the hidden state processed to get the reward. But in the code, it seems that reward is shifted back by 1. What I mean by this is it seems in the code that you process the sequence of observations to give you a sequence of posteriors and then you train the reward head on the reward sequence using these posteriors where the reward sequence starts from r_1 but the first posterior in the sequence is z_1. So I was wondering if you pad the reward sequence in the beginning or I am misunderstanding how the world model works. I've attached an image to illustrate my confusion:

dreamer

In my image, the _init variables are just the initial masked variables and/or the learned starting state. I assume the prev_action in this case is just zeros.

Thank you.

danijar / dreamerv3

[Question] Regarding training the reward head #58