andyljones / boardlaw

Scaling scaling laws with board games.
https://andyljones.com/boardlaw
MIT License
38 stars 7 forks source link

\lambda constant wrong ? #15

Open fabricerosay opened 3 years ago

fabricerosay commented 3 years ago

Hello, I belive the formula for \lambda in the policy calculation is wrong (line 95 in mcts cpp and cuda mcts) p.lambda_n = m.c_puct[b]*float(N)/float(N +A); It should be p.lambda_n = m.c_puct[b]*sqrt(float(N))/float(N +A); I think that's why you found it working with cpuct as small as 1/16 as it compensate the float(N) that's too big. It should also change the whole dynamic of puct algorithm. PS: great work anyway :), made a replica using Julia

andyljones commented 3 years ago

Well heck, that's embarrassing. Thanks so much for spotting it! I'll add a comment to the code later, and a note in the next version of the paper crediting you. 'Fabrice Rosay' is the name I should use, I presume?

I agree it'll alter the behaviour of the algorithm. My intuition is that it'll speed up exploration early in each step, likely make training even faster. I think many of the exact numbers I reported are likely to change, but I don't expect it to change the overall conclusions of the paper - what do you think?

Ideally I'd be able to find a few days to re-run the experiments, but it's unlikely right now as I'm extremely busy for the foreseeable 🙁

fabricerosay commented 3 years ago

I don't know the effect i think it is hard to predict, so I've launched a run to test your configuration. The lower lambda the less you explore so for large N your formula explore more but, if you do the maths, classical formula with c=2 gives higher lambda than yours with c=1/64 for 64 rollouts. But the dynamic of lambda is completly different. If you take into account A(number of actions), then for the classical formula lambda is increasing till N=A then drops to 0 whereas for you formula lambda is incrinsing( but for A big enough compared to N, both formula behave the same) So for classical formula the dynamic is closer to usual monte carlo first explore, then exploit. Yet i realized that Grill and all version of puct is rather different than usual puct. (as "local" visits do not enter into the formula) it's strange, but seems to work. I did many kind of implementations and one strange thing is that many of them where flawed in some way, but it was hard to spot as they still did somehow worked.

andyljones commented 3 years ago

My instinct is that a bad lambda_n schedule is something that manifests as 'drag' rather than as stalled performance. I suspect you could pick most any 'reasonable' schedule, throw enough compute at it and have it work.

fabricerosay commented 3 years ago

Well I did some experiments on connect4: base line standard run cpuct=2, good formula, and 100 iteration of 32k games Base line is able to win vs perfect player when it starts, at the end of the run new policy vs old policy is always around 100 % victory for the one that start (that's intended as first player can force a win at connect 4), also it achieves 6-7% errors on solved positions -same run but cpuct=1 (and in general "too low cpuct") usually it end with many draws, didn't test vs perfect player, around 10-11% erros on tests sets -"your formula" with cpuct =1/16 or cpuct =1/4 end in the same way as too low cpuct If you follow the best run the dynamic is always the same:

It's possible that bad formula only drags things and you see only phase 2 after 100 iteration, maybe if run is longer, it will evolve to pahse 3. But usually you also see the loss getting very low, probably lowering too much exploration, hence it gets harder and harder to escape phase 2 (with same parameters). On the contrary wih cpuct=2 loss is around 1 at the end and stays around this value (whereas with low cpuct it goes down to 0.7 also it lowers way faster than with cpuct=2, expected) so maybe you can counter this by exploring more. I wonder if it could work by using epsilon greedy instead of sampling policy as done in https://arxiv.org/pdf/2012.11045.pdf. After all what make the policy good, is only how good is your q values, what ever the way you achieve this. Somehow you can imagine decoupling completly the search policy with one algorithm, producing another policy, used with a diferent algorithm like standard cpuct. I don't know if it's clear ;)