LyWangPX / Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions

Solutions of Reinforcement Learning, An Introduction
MIT License
2.02k stars 466 forks source link

Execise 6.1 #79

Open RangerChu opened 3 years ago

RangerChu commented 3 years ago

V_t denote the array of state values used at time t in the TD error (6.5) and in the TD update (6.2). And delta_t is calculated at time t+1.

QQ图片20210323111639

The agent only updates the V value of S_t at the time of t+1, and the V values of other states remain unchanged.

QQ图片20210323111643

1

IMG_20210323_113648_edit_190878386338582

ehddnr747 commented 3 years ago

You misunderstand the problem definition. Even though deltat is calculated at time step t+1 with R[t+1] and S_[t+1], use call it V_t, which is used to calculate delta_t.

zexiangliu commented 3 years ago

I agree with @ehddnr747 that V_t is used to calculate deltat instead of V{t+1}. If that is fixed in @RangerChu 's answer, we should have a correct solution.