Closed kiudee closed 4 years ago
You missed the early game and endgame temperature parameters. Since the number of visits per move affects the RL dynamics, I would include it as an option as well. Other than that I think you included all affected parameters. I will mark my answers as either "conclusion from theoretical work" or "gut feeling".
CPuct
at 800 visits, without noise the equilibrium policies at start position seem acceptable in [0.3,0.5]
. Since A) noise will flatten them a bit and B) lower CPuct
should mean a more accurate eval in the tree I would use the lower end of the interval and propose CPuct = 0.3
[conclusion from theoretical work]CPuctBase
and CPuctFactor
don't play a role for RL at all except raising the CPuct
at 800 visits by 0.08; we could either lower the CPuct
to 0.25 or set CPuctFactor
to 0. [conclusion from theoretical work]DirichletNoise
: moves with approx. 0 policy can still exist despite not being the expected policies of acceptable moves unlike in the current formula. I personally wouldn't change it, since it is most probably not harmful [gut feeling]PolicyTemperature
: in RL, having this >1 produces equilibrium policies with the AZ PUCT formula. However, the new UCB formula effectively behaves as having PolicyTemperature=2.0
so leaving it at 1.0
is fine. In theory, any value in [0.55,2.0]
might be feasible, and a higher value basically acts similar to lower number of visits and higher cpuct as they all produce wider equilibrium policies. [conclusion from theoretical work]FPUValue
: this is the only one I can't really recommend something. Compared to the current formula, N*cpuct
is even to sqrt(N)*cpuct
with a factor of 8 smaller CPuct
at 64 nodes, but we expect a bit higher policies for reasonable moves. Leaving it at the current value isn't at least off by a order of magnitude, the expected effect would be a bit wider search near root and a bit narrower search deeper in the tree. [mixed]And one more remark concerning the loss function: Since we expect a bit wider policies, the policy loss value will probably be higher in general. However we don't expect to have the problem of "overfitting policy head too fast" since the equilibrium policies aren't one-hot, so it might even be feasible to increase the policy loss weight to enforce a faster policy convergence if the statistics/value estimations of moves shift.
I agree with @Naphthalin values:
Standard lc0 with training defaults (cpuct=2.5):
New-u, cpuct 0.3 with training defaults:
New-u, cpuct 0.45 with training defaults:
Interactive version available (temporarily) on below link. One can use slider to compare cpuct values. The standard lc0 is the leftmost. https://script.google.com/macros/s/AKfycbzA7OyZd_j_ZZ98HWDpkbTRmFMiivnbDrPIvJEKk47SgMQPI574/exec?file_name=New-u_cpucts.html
New-u cpuct tuning results indicate that strongest play is around range [0.4-0.5] for 800 nodes and standard training params. Didn't get as many samples as I wished and don't have time to run it for longer.
Used method:
Below plot shows score difference as a function of cpuct. Red dots are the observations and green line is the prediction of the surrogate model.
Thank you @jkormu for your plots and tuning. I like the way you use skopt, which I prefer over CLOP.
Just from looking at the tree plots (ignoring for now that the policies are more concentrated than we expect them to be in our run), higher cpuct values around 0.45
appear to result in a tree resembling the default cpuct, while being wider in low depth. The tuning also seems to indicate that this region is resulting in strongest play. Since our goal for the reinforcement learning run is anyway to test the hypothesis that new_u
results in better opening diversity, I would propose going for cpuct ∈ [0.35, 0.45]
. Thoughts?
@jkormu thx for your tests, especially concerning FPU and cpuct! If I understand it right, this also includes the 0.08
increase of cpuct at root due to scaling? This detail is important because it mostly affects the expected equilibrium policies of opening moves.
Since we expect a bit flatter policies I would suggest going to the lower end of the sensible interval which should then effectively result in comparable exploration.
I took a closer look at the FPU and Dirichlet noise parameters.
If we want to be as close as possible to the current behavior, we actually need to reduce the amount of Dirichlet noise since the newU formula behaves different at small child_visits:
One can calculate the needed policy for a move to get its 1st visit assuming FPU 1.0, its 3rd or its 6th visit, assuming its current estimate of Q is 0.2 below the best Q:
cpuct 1st 3rd 6th
old-U 2.5 1.4% 0.85% 1.7%
new-U 0.3 0.4% 0.43% 1.2%
new-U 0.5 0.25% 0.26% 0.7%
The current Dirichlet noise distributes 25% policy between the ~20-25 possible moves so about 1% per move. As the needed policies for the different scenarios are smaller by a factor of [2, 3.5] for cpuct 0.3 and [2.5, 5] for cpuct 0.5 it would be sensible to reduce the 25% dirichlet noise to 10% to make this run as comparable to the recent runs as possible.
FPU should stay at 1.0 since the expected behavior falls in line with the behavior at 2 and 5 visits.
@Naphthalin, in all these visualizations and in the tuning run cpuct-factor=0.0 that disables the scaling. This is default in training.
For some reasons I missed that. That simplifies things, thank you very much!
Since cpuct-scaling is disabled in training apparently, I think we can cross that off the list as well. For deciding cpuct it depends on if we want to explore openings as diverse as possible or if we still want to see favorite lines and ~50% training in e4.
After discussions and elaboration we arrived at a sensible configuration of the parameters for a test run. I updated the issue using the current consensus.
The alternative solution of using --policy-softmax-temp=1.2
was used successfully since T58/T60, and T70 showed the possibility of reducing --noise-epsilon
as well without seeing policies collapse.
Also, #918 was never merged (and doesn't need to, see #913), so this issue has served its purpose and can be closed.
Continuing the discussion started in #913 and #918, I would like to start planning a reinforcement learning test run. Specifically, what parameters need to be changed and to which values? I will mark all parameters as finished for which we have a good value or reason to leave them at the default.
Parameters
CPuct = 0.45 or [0.3, 0.45]
CPuctBase = (default)
~CPuctFactor = 0.
DirichletNoise = 0.1
PolicyTemperature = 1.0 or [0.55, 2.0]
FPUValue = default
Temperature = 1.0
TempEndgame ∈ [0.2, 0.45]
TempVisitOffset = 0
policy_loss_weight = 1.0 (default)
MinimumKLDGainPerNode = 0.0
Let me know, if I forgot another parameter impacted by the U formula change.