Closed burmecia closed 4 years ago
I think the update equations for Double Expected Sarsa with epsilon-greedy target policy can be:
Q_{1}(S_{t},A_{t})\leftarrow Q_{1}(S_{t},A_{t}) + \alpha\left[R_{t+1}+\gamma\sum_a\pi(a|S_{t+1})Q_{2}(S_{t+1},a)-Q_{1}(S_{t},A_{t})\right]
where
\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|}, & if a=argmax_{a}(Q_{1}(s,a')+Q_{2}(s,a'))\\\frac{\epsilon}{|A(s)|}, & otherwise\end{cases}
Looks valid. Will add it to 6.13 and mark your name.
I think it should be made clear, that Q_1 and Q_2 need to be swapped with a probability of 0.5 in each step of the episode.
I think the update equations for Double Expected Sarsa with epsilon-greedy target policy can be:
where