danijar / dreamerv3

Mastering Diverse Domains through World Models
https://danijar.com/dreamerv3
MIT License
1.28k stars 219 forks source link

Index issue in score function of vf? #70

Closed Jogima-cyber closed 1 year ago

Jogima-cyber commented 1 year ago

Hi, I hope I'm not mistaken, but I think there is an index issue in the score function of the value function approximator: If I understand everything correctly we want the prediction of the rewards of all the transitions, including the starting one, but not the last one:

Capture d’écran 2023-06-20 à 17 30 39

But it appears to me the implementation takes the predicted reward of all transitions including the last one but excluding the first one: https://github.com/danijar/dreamerv3/blob/423291a9875bb9af43b6db7150aaa972ba889266/dreamerv3/agent.py#L360-L362 https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/behaviors.py#L14 So maybe there is an index inversion somewhere in the code, but I didn't see one. There is the same issue with the continuation predictor: https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/agent.py#L364

Jogima-cyber commented 1 year ago

Okay I've checked again and understood how it worked. No issue actually. Sorry about that.

czp16 commented 8 months ago

Hi @Jogima-cyber , could you explain how it works? I'm still confused about the codes here. The codes here indicate that we should use wm.heads['reward'](traj).mean()[1] to compute the reward of h[0] and z[0] (the first deterministic and stochastic states in traj).

Does this mean that reward head is computing $(h{t+1}, z{t+1}) \to r_t$ instead of $(h_t, z_t) \to r_t$?

jren03 commented 5 months ago

Hi @czp16, if I am not mistaken from @danijar 's response on these two posts [1, 2], I think the notation is actually suggesting that the reward head predicts $(h{t}, z{t}) \rightarrow r_{t+1}$. I found the diagram in [1] to be pretty helpful.