Open HencyChen opened 6 years ago
^ because of the fact that there can be multiple V(s) and A(s,a) that satisfy the Advantage equation. For example,
Q(s,a) = V(s) + A(s,a) = (V(s)+c) + (A(s,a)-c)
So, to learn that unique V and A, you subtract mean of Advantage for actions so the advantage for the optimal action is 0.
Thanks for offering this wonderful code. But I have a question.