Open dfridovi opened 7 years ago
First few bullets done. Diverges though even with finite action space.
Seems to work pretty well now. Added fixed Q targets and that fixed it. Would be cool to try on a harder environment.
Another idea (sort of like Bayesian Optimization):
The basic idea is to represent the joint state-action value function as a Gaussian process. The optimal policy can be approximated with a few steps of gradient descent on the action subspace, holding state fixed.
A few ideas here:
Some extensions: