Closed EdanToledo closed 1 year ago
Hi, good question. The time step alignment used in this repository is what is shown in the first line not the second line (and it's also what the newer Sutton & Barto book recommends):
r1=0, s1 -> a1 -> r2, s2 -> a2 -> ...
s1 -> a1 -> r1, s2 -> a2 -> ...
In other words, the reward at some index is the consequence of the action one index earlier, not of the action at the same index.
Hello,
I had a quick question regarding the reward head that I was hoping you could help clarify. In the diagrams in the paper, you seem to predict the reward of the current transition from the successive state i.e. if you have the transition (s_1, a_1, r_1, s_2) you would predict r_1 using the posterior state encoded from s_2. This makes sense in my head as you need the action and the hidden state processed to get the reward. But in the code, it seems that reward is shifted back by 1. What I mean by this is it seems in the code that you process the sequence of observations to give you a sequence of posteriors and then you train the reward head on the reward sequence using these posteriors where the reward sequence starts from r_1 but the first posterior in the sequence is z_1. So I was wondering if you pad the reward sequence in the beginning or I am misunderstanding how the world model works. I've attached an image to illustrate my confusion:
In my image, the _init variables are just the initial masked variables and/or the learned starting state. I assume the prev_action in this case is just zeros.
Thank you.