LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.45k stars 530 forks source link

Initialize Q = 0 instead of parent Q for self-play to match AGZ paper #344

Closed Mardak closed 5 years ago

Mardak commented 6 years ago

From AGZ paper: screen shot 2018-09-07 at 11 40 03 am

I realize initializing to parent Q is common and then additionally reducing with first play urgency is used by Leela Zero and copied to lczero/lc0, and doing that and tuning can improve match strength.

However, self-play training data quality can be reduced when using "match settings" leading to different default values for cpuct, softmax, fpu reduction, etc. as well as turning on/off code specifically for self-play, e.g., https://github.com/gcp/leela-zero/pull/1083

The behavior of Q = 0 instead of parent Q makes it so that in winning positions, the first found "good enough" move will likely have a dominating Q > 0 relative to unvisited Q = 0 moves. Similarly, from a losing position, search will naturally go wider as the usual "best" move probably still has a Q < 0 leading to visits of seemingly Q = 0 moves.

Almost all of the positions in #8 happen to be from losing positions with one good move, so Q = 0 ends up finding all of the expected tactical moves not even needing 800 visits as once search starts going wide, it realizes the "hidden"-very-low-prior tactical move is actually the best of all possible moves.

But even ignoring the fact that Q = 0 happens to improve tactical training in those select positions, it sounds like a main project goal is to "reproduce AZ as closely as possible." There are no details of Q = 0 or other Q in AZ paper itself, so falling back to the previous AGZ paper, it would seem to imply unvisited moves should have Q = 0 for self-play.

leedavid commented 6 years ago

yes. i think your idea is very interesting.

Videodr0me commented 6 years ago

Tried that already for playing outside training, there its worse (at least under my testing conditions): https://github.com/Videodr0me/leela-chess-experimental/wiki/Sanity-Tests

Its not clear what DM used for A0, but based on my tests, for chess at least and non-training, parent q seems strongest.

As for what works best in training (as opposed to playing outside of training)- thats another question, but the strength difference is rather large.... so there are pros and cons. Could be tried in training but i would not expect miracles...

mooskagh commented 6 years ago

That's also what lc0 used back at the time when lczero was still the official engine. It also shown to be weaker.

Mardak commented 6 years ago

It also shown to be weaker.

Shown how? Which networks used training data that were generated with Q=0?

Mardak commented 6 years ago

non-training, parent q seems strongest.

I agree that having something for FPU can be very important for playing matches, but using that to try to infer quality of training data is very misguided. For example, the randomness added by noise and temperature are purposefully not playing at "full strength" so that the network can learn from better training data.

Similarly, Q=0 means there's more of a clear difference in how training works for losing vs winning side instead of self-play games trying to "do the usual thing." With Q=0, the losing side's purpose is to search wide and find good moves that are hidden while winning side's purpose is to reinforce and validate whether a move is actually good.

Mardak commented 6 years ago

Looking through the commit history, I don't see when lczero ever used FPU = 0.

Add recursive search depth, remove FPU VL bug jkiliani April 29 https://github.com/LeelaChessZero/lczero/blame/5e74337a05f3ed5d2baa1d7db2ff23f96306f1b4/src/UCTNode.cpp#L338 auto fpu_eval = (cfg_fpu_dynamic_eval ? get_raw_eval(color) : net_eval) - fpu_reduction;

Add fpu_dynamic_eval option and enable it Tilps April 11 https://github.com/LeelaChessZero/lczero/blame/cd8b1c299630b9dae37353547eeb103255815aa1/src/UCTNode.cpp#L338 auto fpu_eval = (cfg_fpu_dynamic_eval ? get_eval(color) : net_eval) - fpu_reduction;

Return fpu eval to use the static net eval as starting point Tilps April 11 https://github.com/LeelaChessZero/lczero/blame/0e32b223c83fbc50d23e0e4c14e201cca8fa68a2/src/UCTNode.cpp#L316 auto fpu_eval = net_eval - fpu_reduction;

Reduce first play urgency jkiliani March 28 https://github.com/LeelaChessZero/lczero/blame/3f7b6c64c0bcc477916c881800496e757186c7a8/src/UCTNode.cpp#L311 auto fpu_eval = get_eval(color) - fpu_reduction;

Port UCTNode simplifications from Leela Zero glinscott January 12 https://github.com/LeelaChessZero/lczero/blame/eeb6ea6eff781f66f4fa0f43b0420afb30cf571a/src/UCTNode.cpp#L288

// If a node has not been visited yet, the eval is that of the parent.
auto winrate = child->get_eval(color);

Add files via upload benediamond December 21 https://github.com/LeelaChessZero/lczero/blame/e9b2c71050b8da543d205263972070d69eedbc71/src/UCTNode.cpp#L351

// get_eval() will automatically set first-play-urgency
float winrate = child->get_eval(color);
// If a node has not been visited yet, the eval is that of the parent.
auto eval = m_init_eval;
Videodr0me commented 6 years ago

Similarly, Q=0 means there's more of a clear difference in how training works for losing vs winning side instead of self-play games trying to "do the usual thing." With Q=0, the losing side's purpose is to search wide and find good moves that are hidden while winning side's purpose is to reinforce and validate whether a move is actually good.

I am a little sceptical about this argument especially if you make it about training and believe it does not hold for normal play. Why shouldn't in normal play (outside training), the loosing side's "purpose" to be searching wide and the winning side's "purpose" be to "validate" whether a move is actually good? Thus, if this argument holds then initializing q to 0 should yield elo in normal play - but it does not.

This is not saying that there is no interaction between FPU, learning and the ceiling of the fully trained NN. By all means just try - maybe you can train a smaller net locally and see what happens? I am curious.

Mardak commented 6 years ago

It's because searching wider is not visit efficient and leads to suboptimal visit usage assuming priors are accurate. When playing a match, the network shouldn't second guess itself and trust that the policy is good.

As mentioned earlier, self play settings should be different from match settings. Self play should maximize learning while match maximizes rating.

oscardssmith commented 6 years ago

Close this as evidence suggests this is weaker?

mooskagh commented 5 years ago

Now we have fpu-strategy=absolute. (turned out that A0 used -1)