Open flrngel opened 6 years ago
https://arxiv.org/abs/1602.01783 aka A3C by Google
This paper introduces Asynchronous 1-step Q-Learning, n-step Q-Learning, Sarsa, A3C A3C is best
(image originally from openresearch.ai)
A3C is on-policy method (compare to Q-Learning is off-policy)
Loss = Policy Loss + 0.5 * Value Loss
\pi (x) has (typically) one softmax output for the policy with convolution network
one linear output for value function V with non-output layers shared
https://arxiv.org/abs/1602.01783 aka A3C by Google
This paper introduces Asynchronous 1-step Q-Learning, n-step Q-Learning, Sarsa, A3C A3C is best
(image originally from openresearch.ai)
A3C is on-policy method (compare to Q-Learning is off-policy)
Loss = Policy Loss + 0.5 * Value Loss
\pi (x) has (typically) one softmax output for the policy with convolution network
one linear output for value function V with non-output layers shared