Closed lewisboyd closed 3 years ago
Yes, that's exactly what's happening. You can see that the equation says t < H
not t <= H
.
My concern is about the second line handling the if t=H part since you could rewrite it as V_H = r_H + y_H * v(s_H)
Sorry I think the way I wrote that originally was confusing since I wasn't distinguishing between the lambda target, V, and the value network, v.
Ah, I see. You're right that the equation isn't quite correct for the last time step. The implementation only uses the value at the last step, not the reward, as you suggested.
Okay cool thanks for clarifying! :)
Hi,
I have a question about how you calculate the lambda_target as seen in the equation below.
I've been implementing it to work directly in the environment rather than with the model states to test out how it works and something occurred to me. On your final step, i.e. when t = H, are you not accounting for the reward twice since the Value network is already trained to incorporate the reward of a state into the Value for a state? Would it not be more valid to instead stop calculation at H-1 and use the final H model_state only for bootstrapping, so that the target calculation would become V(s_H-1) = r_H-1 + y_H-1 * V(s_H)?
Thanks again, Lewis