Asynchronous Methods for Deep Reinforcement Learning

This paper introduces Asynchronous 1-step Q-Learning, n-step Q-Learning, Sarsa, A3C A3C is best

(image originally from openresearch.ai)

A3C is on-policy method (compare to Q-Learning is off-policy)

Loss = Policy Loss + 0.5 * Value Loss

\pi (x) has (typically) one softmax output for the policy with convolution network

one linear output for value function V with non-output layers shared

flrngel / understanding-ai