LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.47k stars 535 forks source link

Discussion about reinforcement learning test run of new U formula #924

Closed kiudee closed 4 years ago

kiudee commented 5 years ago

Continuing the discussion started in #913 and #918, I would like to start planning a reinforcement learning test run. Specifically, what parameters need to be changed and to which values? I will mark all parameters as finished for which we have a good value or reason to leave them at the default.

Parameters

Let me know, if I forgot another parameter impacted by the U formula change.

Naphthalin commented 5 years ago

You missed the early game and endgame temperature parameters. Since the number of visits per move affects the RL dynamics, I would include it as an option as well. Other than that I think you included all affected parameters. I will mark my answers as either "conclusion from theoretical work" or "gut feeling".

Naphthalin commented 5 years ago

And one more remark concerning the loss function: Since we expect a bit wider policies, the policy loss value will probably be higher in general. However we don't expect to have the problem of "overfitting policy head too fast" since the equilibrium policies aren't one-hot, so it might even be feasible to increase the policy loss weight to enforce a faster policy convergence if the statistics/value estimations of moves shift.

jkormu commented 5 years ago

I agree with @Naphthalin values:

Standard lc0 with training defaults (cpuct=2.5): image

New-u, cpuct 0.3 with training defaults: image

New-u, cpuct 0.45 with training defaults: image

Interactive version available (temporarily) on below link. One can use slider to compare cpuct values. The standard lc0 is the leftmost. https://script.google.com/macros/s/AKfycbzA7OyZd_j_ZZ98HWDpkbTRmFMiivnbDrPIvJEKk47SgMQPI574/exec?file_name=New-u_cpucts.html

jkormu commented 5 years ago

New-u cpuct tuning results indicate that strongest play is around range [0.4-0.5] for 800 nodes and standard training params. Didn't get as many samples as I wished and don't have time to run it for longer.

Used method:

Below plot shows score difference as a function of cpuct. Red dots are the observations and green line is the prediction of the surrogate model.

image

kiudee commented 5 years ago

Thank you @jkormu for your plots and tuning. I like the way you use skopt, which I prefer over CLOP.

Just from looking at the tree plots (ignoring for now that the policies are more concentrated than we expect them to be in our run), higher cpuct values around 0.45 appear to result in a tree resembling the default cpuct, while being wider in low depth. The tuning also seems to indicate that this region is resulting in strongest play. Since our goal for the reinforcement learning run is anyway to test the hypothesis that new_u results in better opening diversity, I would propose going for cpuct ∈ [0.35, 0.45]. Thoughts?

Naphthalin commented 5 years ago

@jkormu thx for your tests, especially concerning FPU and cpuct! If I understand it right, this also includes the 0.08 increase of cpuct at root due to scaling? This detail is important because it mostly affects the expected equilibrium policies of opening moves.

Since we expect a bit flatter policies I would suggest going to the lower end of the sensible interval which should then effectively result in comparable exploration.

Naphthalin commented 5 years ago

I took a closer look at the FPU and Dirichlet noise parameters.

If we want to be as close as possible to the current behavior, we actually need to reduce the amount of Dirichlet noise since the newU formula behaves different at small child_visits:

One can calculate the needed policy for a move to get its 1st visit assuming FPU 1.0, its 3rd or its 6th visit, assuming its current estimate of Q is 0.2 below the best Q:

       cpuct     1st      3rd       6th
old-U   2.5     1.4%     0.85%     1.7%
new-U   0.3     0.4%     0.43%     1.2%
new-U   0.5     0.25%    0.26%     0.7%

The current Dirichlet noise distributes 25% policy between the ~20-25 possible moves so about 1% per move. As the needed policies for the different scenarios are smaller by a factor of [2, 3.5] for cpuct 0.3 and [2.5, 5] for cpuct 0.5 it would be sensible to reduce the 25% dirichlet noise to 10% to make this run as comparable to the recent runs as possible.

FPU should stay at 1.0 since the expected behavior falls in line with the behavior at 2 and 5 visits.

jkormu commented 5 years ago

@Naphthalin, in all these visualizations and in the tuning run cpuct-factor=0.0 that disables the scaling. This is default in training.

Naphthalin commented 5 years ago

For some reasons I missed that. That simplifies things, thank you very much!

Naphthalin commented 5 years ago

Since cpuct-scaling is disabled in training apparently, I think we can cross that off the list as well. For deciding cpuct it depends on if we want to explore openings as diverse as possible or if we still want to see favorite lines and ~50% training in e4.

kiudee commented 5 years ago

After discussions and elaboration we arrived at a sensible configuration of the parameters for a test run. I updated the issue using the current consensus.

Naphthalin commented 4 years ago

The alternative solution of using --policy-softmax-temp=1.2 was used successfully since T58/T60, and T70 showed the possibility of reducing --noise-epsilon as well without seeing policies collapse.

Also, #918 was never merged (and doesn't need to, see #913), so this issue has served its purpose and can be closed.