Closed Jogima-cyber closed 1 year ago
Okay I've checked again and understood how it worked. No issue actually. Sorry about that.
Hi @Jogima-cyber , could you explain how it works? I'm still confused about the codes here. The codes here indicate that we should use wm.heads['reward'](traj).mean()[1]
to compute the reward of h[0] and z[0] (the first deterministic and stochastic states in traj
).
Does this mean that reward head is computing $(h{t+1}, z{t+1}) \to r_t$ instead of $(h_t, z_t) \to r_t$?
Hi, I hope I'm not mistaken, but I think there is an index issue in the score function of the value function approximator: If I understand everything correctly we want the prediction of the rewards of all the transitions, including the starting one, but not the last one:
But it appears to me the implementation takes the predicted reward of all transitions including the last one but excluding the first one: https://github.com/danijar/dreamerv3/blob/423291a9875bb9af43b6db7150aaa972ba889266/dreamerv3/agent.py#L360-L362 https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/behaviors.py#L14 So maybe there is an index inversion somewhere in the code, but I didn't see one. There is the same issue with the continuation predictor: https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/agent.py#L364