Discussion about reinforcement learning test run of new U formula

kiudee commented 5 years ago

Continuing the discussion started in #913 and #918, I would like to start planning a reinforcement learning test run. Specifically, what parameters need to be changed and to which values? I will mark all parameters as finished for which we have a good value or reason to leave them at the default.

Parameters

[x] CPuct = 0.45 or [0.3, 0.45]
[x] ~CPuctBase = (default)~
[x] CPuctFactor = 0.
[x] DirichletNoise = 0.1
[x] PolicyTemperature = 1.0 or [0.55, 2.0]
[x] FPUValue = default
[x] Temperature = 1.0
[x] TempEndgame ∈ [0.2, 0.45]
[x] TempVisitOffset = 0
[x] policy_loss_weight = 1.0 (default)
[x] MinimumKLDGainPerNode = 0.0

Let me know, if I forgot another parameter impacted by the U formula change.

Naphthalin commented 5 years ago

You missed the early game and endgame temperature parameters. Since the number of visits per move affects the RL dynamics, I would include it as an option as well. Other than that I think you included all affected parameters. I will mark my answers as either "conclusion from theoretical work" or "gut feeling".

CPuct at 800 visits, without noise the equilibrium policies at start position seem acceptable in [0.3,0.5]. Since A) noise will flatten them a bit and B) lower CPuct should mean a more accurate eval in the tree I would use the lower end of the interval and propose CPuct = 0.3 [conclusion from theoretical work]
CPuctBase and CPuctFactor don't play a role for RL at all except raising the CPuct at 800 visits by 0.08; we could either lower the CPuct to 0.25 or set CPuctFactor to 0. [conclusion from theoretical work]
DirichletNoise: moves with approx. 0 policy can still exist despite not being the expected policies of acceptable moves unlike in the current formula. I personally wouldn't change it, since it is most probably not harmful [gut feeling]
PolicyTemperature: in RL, having this >1 produces equilibrium policies with the AZ PUCT formula. However, the new UCB formula effectively behaves as having PolicyTemperature=2.0 so leaving it at 1.0 is fine. In theory, any value in [0.55,2.0] might be feasible, and a higher value basically acts similar to lower number of visits and higher cpuct as they all produce wider equilibrium policies. [conclusion from theoretical work]
FPUValue: this is the only one I can't really recommend something. Compared to the current formula, N*cpuct is even to sqrt(N)*cpuct with a factor of 8 smaller CPuct at 64 nodes, but we expect a bit higher policies for reasonable moves. Leaving it at the current value isn't at least off by a order of magnitude, the expected effect would be a bit wider search near root and a bit narrower search deeper in the tree. [mixed]
temperature during play: using 1.0 in early game and [0.2,0.45] in endgame should be fine [gut feeling]

Naphthalin commented 5 years ago

And one more remark concerning the loss function: Since we expect a bit wider policies, the policy loss value will probably be higher in general. However we don't expect to have the problem of "overfitting policy head too fast" since the equilibrium policies aren't one-hot, so it might even be feasible to increase the policy loss weight to enforce a faster policy convergence if the statistics/value estimations of moves shift.

jkormu commented 5 years ago

I agree with @Naphthalin values:

Fixing PolicyTemperature = 1.0. seems natural. This is also used in current training runs.
After visually inspecting search trees with different FPUValues I think that leaving FPUValue as current default 1.0 is fine as there doesn't seem to be any crazy interactions with the new formula.
Given above values, Cpuct in range [0.3, 0.5] looks good when comparing search trees visually to standard lc0 training defaults. This should also be the range of strongest play. Running overnight cpuct tuning to verify this keeping other parameters fixed.

Standard lc0 with training defaults (cpuct=2.5):

New-u, cpuct 0.3 with training defaults:

New-u, cpuct 0.45 with training defaults:

Interactive version available (temporarily) on below link. One can use slider to compare cpuct values. The standard lc0 is the leftmost. https://script.google.com/macros/s/AKfycbzA7OyZd_j_ZZ98HWDpkbTRmFMiivnbDrPIvJEKk47SgMQPI574/exec?file_name=New-u_cpucts.html

jkormu commented 5 years ago

New-u cpuct tuning results indicate that strongest play is around range [0.4-0.5] for 800 nodes and standard training params. Didn't get as many samples as I wished and don't have time to run it for longer.

Used method:

Black box function plays 70 games lc0 net42850 vs sf, reversed openings and returns score diff
Openings are randomly selected for each call from chad 10 ply book
Minimization of this function is is done using python library skopt function gp_minimize. Suits for noisy black-box-functions
cpuct is only parameter to tune here

Below plot shows score difference as a function of cpuct. Red dots are the observations and green line is the prediction of the surrogate model.

kiudee commented 5 years ago

Thank you @jkormu for your plots and tuning. I like the way you use skopt, which I prefer over CLOP.

Just from looking at the tree plots (ignoring for now that the policies are more concentrated than we expect them to be in our run), higher cpuct values around 0.45 appear to result in a tree resembling the default cpuct, while being wider in low depth. The tuning also seems to indicate that this region is resulting in strongest play. Since our goal for the reinforcement learning run is anyway to test the hypothesis that new_u results in better opening diversity, I would propose going for cpuct ∈ [0.35, 0.45]. Thoughts?

Naphthalin commented 5 years ago

@jkormu thx for your tests, especially concerning FPU and cpuct! If I understand it right, this also includes the 0.08 increase of cpuct at root due to scaling? This detail is important because it mostly affects the expected equilibrium policies of opening moves.

Since we expect a bit flatter policies I would suggest going to the lower end of the sensible interval which should then effectively result in comparable exploration.

Naphthalin commented 5 years ago

I took a closer look at the FPU and Dirichlet noise parameters.

If we want to be as close as possible to the current behavior, we actually need to reduce the amount of Dirichlet noise since the newU formula behaves different at small child_visits:

One can calculate the needed policy for a move to get its 1st visit assuming FPU 1.0, its 3rd or its 6th visit, assuming its current estimate of Q is 0.2 below the best Q:

       cpuct     1st      3rd       6th
old-U   2.5     1.4%     0.85%     1.7%
new-U   0.3     0.4%     0.43%     1.2%
new-U   0.5     0.25%    0.26%     0.7%

The current Dirichlet noise distributes 25% policy between the ~20-25 possible moves so about 1% per move. As the needed policies for the different scenarios are smaller by a factor of [2, 3.5] for cpuct 0.3 and [2.5, 5] for cpuct 0.5 it would be sensible to reduce the 25% dirichlet noise to 10% to make this run as comparable to the recent runs as possible.

FPU should stay at 1.0 since the expected behavior falls in line with the behavior at 2 and 5 visits.

jkormu commented 5 years ago

@Naphthalin, in all these visualizations and in the tuning run cpuct-factor=0.0 that disables the scaling. This is default in training.

Naphthalin commented 5 years ago

For some reasons I missed that. That simplifies things, thank you very much!

Naphthalin commented 5 years ago

Since cpuct-scaling is disabled in training apparently, I think we can cross that off the list as well. For deciding cpuct it depends on if we want to explore openings as diverse as possible or if we still want to see favorite lines and ~50% training in e4.

kiudee commented 5 years ago

After discussions and elaboration we arrived at a sensible configuration of the parameters for a test run. I updated the issue using the current consensus.

Naphthalin commented 4 years ago

The alternative solution of using --policy-softmax-temp=1.2 was used successfully since T58/T60, and T70 showed the possibility of reducing --noise-epsilon as well without seeing policies collapse.

Also, #918 was never merged (and doesn't need to, see #913), so this issue has served its purpose and can be closed.

LeelaChessZero / lc0

Discussion about reinforcement learning test run of new U formula #924

Parameters