[Question] Some questions about computation of λ-returns.

danijar / dreamerv3

Mastering Diverse Domains through World Models

https://danijar.com/dreamerv3

MIT License

1.36k stars 231 forks source link

[Question] Some questions about computation of λ-returns. #14

Closed hdadong closed 1 year ago

hdadong commented 1 year ago

Hi,dininar! Such a great job! I have some question about the code details。 In your paper, λ-returns is computed by rt and v(s{t+1}) But I found that in your code, you use the rt and v(s{t}), why, Is there something wrong with my understanding? Here is the code about λ-returns: https://github.com/danijar/dreamerv3/blob/main/dreamerv3/agent.py#L367 https://github.com/danijar/dreamerv3/blob/main/dreamerv3/behaviors.py#L14

danijar commented 1 year ago

If you use x_t -> a_t -> r_t+1 notation as in the paper, the world model computes s_t from x_t and the history and predict r_t+1 from s_t.

That means when you predict rewards in the code as reward_predictor(s_sequence) then we need to discard the first reward for the time steps to be aligned as your second code link shows.

Then to compute the returns, you need r_t and v_t+1. I do that by removing the first value from the value tensor as your first code link shows.

Hope that helps!