Distributed Policy Gradient

Reinforcement learning agents often train using simulators with the caveat that generalization from simulators to real-life is often not good. Policy gradient methods directly train a policy network (mapping states -> actions) by running a given policy in a simulator and getting gradients of the log probabilities (which are weighted by the reward-to-go).

To speed up training, one can run multiple simulators in parallel using a given policy, collect the gradients, and update the model. This process is repeated for many iterations.

Task: Train a simple policy gradient (REINFORCE + causal with simple baselines) with N simulators. See:

https://github.com/TreeinRandomForest/gccoptim/blob/master/policygradients/cartpole.py
https://pytorch.org/tutorials/intermediate/rpc_async_execution.html

Make the following plot: x-axis: train time y-axis: current reward (for the last K trajectories) One curve per N = number of parallel simulators

You can do this synchronously as well as asynchronously

aicoe-kaggle / diabetic-retinopathy

Distributed Policy Gradient #8