Closed PFCM closed 8 years ago
Also note that currently the fact that rewards are inverted at each step during the backup is not reflected in the tree policy. I think this is what should be happening -- that way UCB can still take the max at each stage and it should eventually become some kind of minimax.
actually starting to think that it might be in the reward calculation & hence specific to hex, need to double check by playing some go
Figure out when to stop using the tree policy and start the rollout.
Silver's 2011 paper has rollouts every time we add a node (and then just keep searching). This seems like it should be better than what happens at the moment as it will always be getting more accurate (closer to the bottom) and may indeed be essential for convergence.