andyljones / boardlaw

Scaling scaling laws with board games.
https://andyljones.com/boardlaw
MIT License
38 stars 7 forks source link

Policy/value split #8

Closed andyljones closed 3 years ago

andyljones commented 3 years ago

There's a lot of analytics that'd be easier if I could toy with the value and policy networks independently. And the PPG paper shows that it can lead to a serious performance bump too - you just need to retain a way of letting the policy net piggyback on the value net's features.

PPG can't be adapted directly though because we don't have fresh logits in the learner, and frankly I don't like all the hyperparams PPG adds either. So I'll need to do some exploration to figure out what works and is simple enough for my tastes.

To rephrase: is the important part of PPG the multitask learning? If it is, can I sub their KL-based distillation out for something else? If not, what is the important part of PPG?

andyljones commented 3 years ago

Did some experimentation with this and couldn't get anything promising at all. Come away thinking that AZ is actually pretty close to PPG already: you can think of PPG's policy phase as 'learning a better policy' and then the aux phase as 'training against that better policy'.

So policy/aux phase aside, there's the question of whether optimal sample staleness and sample repetition might be different for the policy and the value nets. Got some ongoing experiments around that, but a proper exploration'll have to wait for the experiment runner.