LyWangPX / Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions

Solutions of Reinforcement Learning, An Introduction
MIT License
1.97k stars 461 forks source link

Ex 6.13 #52

Closed burmecia closed 4 years ago

burmecia commented 4 years ago

I think the update equations for Double Expected Sarsa with epsilon-greedy target policy can be:

image

Q_{1}(S_{t},A_{t})\leftarrow Q_{1}(S_{t},A_{t}) + \alpha\left[R_{t+1}+\gamma\sum_a\pi(a|S_{t+1})Q_{2}(S_{t+1},a)-Q_{1}(S_{t},A_{t})\right]

where

image

\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|}, & if a=argmax_{a}(Q_{1}(s,a')+Q_{2}(s,a'))\\\frac{\epsilon}{|A(s)|}, & otherwise\end{cases}
LyWangPX commented 4 years ago

Looks valid. Will add it to 6.13 and mark your name.

d-vesely commented 3 years ago

I think it should be made clear, that Q_1 and Q_2 need to be swapped with a probability of 0.5 in each step of the episode.