IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Code: https://github.com/deepmind/scalable_agent

This is a google deepmind paper published in ICML 2018, with very few references.

Comment: The main benefit compared with A3C/A2C is that when with multiple distributed machiens (200+ cpus), it gets much better scalability. The massive distribution, compared with A3C, comes from V-Trace and learner only responsible for learn and actor only responsible for trajectory generation. There is no parameter

Problem:

Innovation: In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace, which was critical for achieving learning stability.

IMPALA (Figure 1) uses an actor-critic setup to learn a policy π and a baseline function Vπ. The process of generating experiences is decoupled from learning the parameters of π and Vπ. The architecture consists of a set of actors, repeatedly generating trajectories of experience, and one or more learners that use the experiences sent from actors to learn π off-policy.

define n-steps V-trace target for V (xs):

Architecture:

QiXuanWang / LearningFromTheBest

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures #31