dfridovi / rl

A homebrewed C++ library for reinforcement learning.
https://dfridovi.github.io/rl
Other
0 stars 0 forks source link

Implement GP Q-learning #6

Open dfridovi opened 7 years ago

dfridovi commented 7 years ago

The basic idea is to represent the joint state-action value function as a Gaussian process. The optimal policy can be approximated with a few steps of gradient descent on the action subspace, holding state fixed.

A few ideas here:

Some extensions:

  1. Could improve accuracy by adding training points to the GP, for example if a radius search has fewer than N neighbors.
  2. Could update only some training means, e.g. those returned by a radius search around a single query point (i.e. batch size 1 only).
  3. Exploration could be done by, e.g., upper confidence bounding in the (sub)optimal policy computation.
dfridovi commented 7 years ago

First few bullets done. Diverges though even with finite action space.

dfridovi commented 7 years ago

Seems to work pretty well now. Added fixed Q targets and that fixed it. Would be cool to try on a harder environment.

dfridovi commented 7 years ago

Another idea (sort of like Bayesian Optimization):

  1. Random initialization with a small number of points
  2. Every epoch add one or more points at maxima of the upper confidence bound, perhaps while fixing state to a few different values