Closed hdadong closed 1 year ago
If you use x_t -> a_t -> r_t+1 notation as in the paper, the world model computes s_t from x_t and the history and predict r_t+1 from s_t.
That means when you predict rewards in the code as reward_predictor(s_sequence) then we need to discard the first reward for the time steps to be aligned as your second code link shows.
Then to compute the returns, you need r_t and v_t+1. I do that by removing the first value from the value tensor as your first code link shows.
Hope that helps!
Hi,dininar! Such a great job! I have some question about the code details。 In your paper, λ-returns is computed by rt and v(s{t+1}) But I found that in your code, you use the rt and v(s{t}), why, Is there something wrong with my understanding? Here is the code about λ-returns: https://github.com/danijar/dreamerv3/blob/main/dreamerv3/agent.py#L367 https://github.com/danijar/dreamerv3/blob/main/dreamerv3/behaviors.py#L14