policy network - Githubissues

I want to make the rollout provide a better estimate. This way we don't have to expand the tree so deep. One way to do this is to use a convnet to provide a policy as per. alphago.

Using this in the rollout is potentially slow, so we may also want to train a smaller net to approximate the larger one at rollout time.

We can train the net using fictitious self play (cf. a couple of Silver papers) -- play agains an average of past strategies. Eventually this converges to some kind of equilibrium (could even be the Nash equlibrium, this is what we want, need to read the papers again).

PFCM / 482-project

policy network #13