This is a google deepmind paper published in ICML 2018, with very few references.
Comment:
The main benefit compared with A3C/A2C is that when with multiple distributed machiens (200+ cpus), it gets much better scalability.
The massive distribution, compared with A3C, comes from V-Trace and learner only responsible for learn and actor only responsible for trajectory generation. There is no parameter
Problem:
Innovation:
In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace, which was critical for achieving learning stability.
IMPALA (Figure 1) uses an actor-critic setup to learn a policy π and a baseline function Vπ. The process of generating experiences is decoupled from learning the parameters of π and Vπ. The architecture consists of a set of actors, repeatedly generating trajectories of experience, and one or
more learners that use the experiences sent from actors to learn π off-policy.
Link: semanticscholar
Code: https://github.com/deepmind/scalable_agent
This is a google deepmind paper published in ICML 2018, with very few references.
Comment: The main benefit compared with A3C/A2C is that when with multiple distributed machiens (200+ cpus), it gets much better scalability. The massive distribution, compared with A3C, comes from V-Trace and learner only responsible for learn and actor only responsible for trajectory generation. There is no parameter
Problem:
Innovation: In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace, which was critical for achieving learning stability.
IMPALA (Figure 1) uses an actor-critic setup to learn a policy π and a baseline function Vπ. The process of generating experiences is decoupled from learning the parameters of π and Vπ. The architecture consists of a set of actors, repeatedly generating trajectories of experience, and one or more learners that use the experiences sent from actors to learn π off-policy.
define n-steps V-trace target for V (xs):
Architecture: