Reinforcement learning agents often train using simulators with the caveat that generalization from simulators to real-life is often not good. Policy gradient methods directly train a policy network (mapping states -> actions) by running a given policy in a simulator and getting gradients of the log probabilities (which are weighted by the reward-to-go).
To speed up training, one can run multiple simulators in parallel using a given policy, collect the gradients, and update the model. This process is repeated for many iterations.
Task: Train a simple policy gradient (REINFORCE + causal with simple baselines) with N simulators. See:
Reinforcement learning agents often train using simulators with the caveat that generalization from simulators to real-life is often not good. Policy gradient methods directly train a policy network (mapping states -> actions) by running a given policy in a simulator and getting gradients of the log probabilities (which are weighted by the reward-to-go).
To speed up training, one can run multiple simulators in parallel using a given policy, collect the gradients, and update the model. This process is repeated for many iterations.
Task: Train a simple policy gradient (REINFORCE + causal with simple baselines) with N simulators. See:
Make the following plot: x-axis: train time y-axis: current reward (for the last K trajectories) One curve per N = number of parallel simulators
You can do this synchronously as well as asynchronously