Open RangerChu opened 3 years ago
You misunderstand the problem definition. Even though deltat is calculated at time step t+1 with R[t+1] and S_[t+1], use call it V_t, which is used to calculate delta_t.
I agree with @ehddnr747 that V_t is used to calculate deltat instead of V{t+1}. If that is fixed in @RangerChu 's answer, we should have a correct solution.
V_t denote the array of state values used at time t in the TD error (6.5) and in the TD update (6.2). And delta_t is calculated at time t+1.
The agent only updates the V value of S_t at the time of t+1, and the V values of other states remain unchanged.