Implement GP Q-learning

dfridovi commented 7 years ago

The basic idea is to represent the joint state-action value function as a Gaussian process. The optimal policy can be approximated with a few steps of gradient descent on the action subspace, holding state fixed.

A few ideas here:

Can start with a random sampling of points across state/action space, with randomly initialized means
Learn the training means of the GP by a few steps of gradient descent against a batch
Might be some interesting convergence arguments to be made

Some extensions:

Could improve accuracy by adding training points to the GP, for example if a radius search has fewer than N neighbors.
Could update only some training means, e.g. those returned by a radius search around a single query point (i.e. batch size 1 only).
Exploration could be done by, e.g., upper confidence bounding in the (sub)optimal policy computation.

dfridovi commented 7 years ago

First few bullets done. Diverges though even with finite action space.

dfridovi commented 7 years ago

Seems to work pretty well now. Added fixed Q targets and that fixed it. Would be cool to try on a harder environment.

dfridovi commented 7 years ago

Another idea (sort of like Bayesian Optimization):

Random initialization with a small number of points
Every epoch add one or more points at maxima of the upper confidence bound, perhaps while fixing state to a few different values

dfridovi / rl

Implement GP Q-learning #6