andyljones / boardlaw

Scaling scaling laws with board games.
https://andyljones.com/boardlaw
MIT License
38 stars 7 forks source link

MCTS upgrades #7

Closed andyljones closed 3 years ago

andyljones commented 3 years ago

Refreshed MCTS

Many of the proposed upgrades below depend on having 'fresh' logits when you feed a batch through the learner. Intuitively, having these fresh weights should help AZ independently of any of these other upgrades. But my own casual experiment with this, *portly-fuel, was a disaster. Why?

Reduced behaviour rollouts

If I do end up doing a full MCTS run for each batch, can I get away with far fewer sims in the actor? Is lots of low-quality experience better than a little high-quality experience? Seems plausible.

This is really attractive because experience collection dominates the runtime. Being able to cut actor sims down from 64 to 16 would immediately quarter the runtime.

Importance sampling

In vanilla reinforcement learning, importance sampling techniques have seen a lot of success. In particular, PPO applies importance sampling to the policy while IMPALA/v-trace applies it to the value, both to adjust for stale samples. Can I adapt these to MCTS?

This requires refreshed MCTS logits.

Value bootstrapping

I'm not clear as to why the MCTS should 'boost' the policy, but not the value?

Right now values are learned from the played games. It feels like a waste to have done 64x as many simulations as real steps, and not use those sims for value learning at all.

Noise

Right now exploration is done by the Dirichlet noise injected at the root of the MCTS. One reason Dirichlet noise is necessary is because in vanilla AZ, you need to visit a subtree a lot for any shiny new values you find deep down to influence the root. In the regularized MCTS I'm using though, that's a lot less important. So can I get away with using a less-weird noise injection scheme?

First stop would be to replace the Dirichlet noise with uniform noise. Then second stop would be to replace the noise injection with a entropy bonus, like every other on-policy RL scheme in existence.

One disadvantage of the entropy bonus is that it forces the net to learn a noisy distribution. Is that something we want?

andyljones commented 3 years ago

Been working at this as a fun distraction the past week, and I think I've exhausted my patience with various attacks on these now. Only success has been the uniform noise, which was impressive enough I should write it up as part of #10.