danijar / dreamerv2

Mastering Atari with Discrete World Models
https://danijar.com/dreamerv2
MIT License
898 stars 195 forks source link

Lamba Target Equation #25

Closed lewisboyd closed 3 years ago

lewisboyd commented 3 years ago

Hi,

I have a question about how you calculate the lambda_target as seen in the equation below.

image

I've been implementing it to work directly in the environment rather than with the model states to test out how it works and something occurred to me. On your final step, i.e. when t = H, are you not accounting for the reward twice since the Value network is already trained to incorporate the reward of a state into the Value for a state? Would it not be more valid to instead stop calculation at H-1 and use the final H model_state only for bootstrapping, so that the target calculation would become V(s_H-1) = r_H-1 + y_H-1 * V(s_H)?

Thanks again, Lewis

danijar commented 3 years ago

Yes, that's exactly what's happening. You can see that the equation says t < H not t <= H.

lewisboyd commented 3 years ago

My concern is about the second line handling the if t=H part since you could rewrite it as V_H = r_H + y_H * v(s_H)

Sorry I think the way I wrote that originally was confusing since I wasn't distinguishing between the lambda target, V, and the value network, v.

danijar commented 3 years ago

Ah, I see. You're right that the equation isn't quite correct for the last time step. The implementation only uses the value at the last step, not the reward, as you suggested.

lewisboyd commented 3 years ago

Okay cool thanks for clarifying! :)