aicoe-kaggle / diabetic-retinopathy

Other
0 stars 0 forks source link

Distributed Policy Gradient #8

Open TreeinRandomForest opened 3 years ago

TreeinRandomForest commented 3 years ago

Reinforcement learning agents often train using simulators with the caveat that generalization from simulators to real-life is often not good. Policy gradient methods directly train a policy network (mapping states -> actions) by running a given policy in a simulator and getting gradients of the log probabilities (which are weighted by the reward-to-go).

To speed up training, one can run multiple simulators in parallel using a given policy, collect the gradients, and update the model. This process is repeated for many iterations.

Task: Train a simple policy gradient (REINFORCE + causal with simple baselines) with N simulators. See:

  1. https://github.com/TreeinRandomForest/gccoptim/blob/master/policygradients/cartpole.py
  2. https://pytorch.org/tutorials/intermediate/rpc_async_execution.html

Make the following plot: x-axis: train time y-axis: current reward (for the last K trajectories) One curve per N = number of parallel simulators

You can do this synchronously as well as asynchronously