LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.38k stars 525 forks source link

improve training by allowing mixed net training games #785

Open jhorthos opened 5 years ago

jhorthos commented 5 years ago

Tactical blind spots are one of the most common causes of game losses for lc0, yet different training runs appear to have largely distinct blind spots. A prime example is the capture-promotion blind spot developed by test40, which arises commonly in real games and results in sudden and devastating losses. This blind spot has not to my knowledge developed before in any full training run.

I propose to use some play against networks from previous training runs to prevent a current training run from developing new blind spots. Alternatives include 1) allowing client to play some fraction of matches with the current training net against other nets (probably nets from other training runs), 2) allowing the client to switch from one net to another during a training game, and 3) running normal training matches and using server-side assessment of incoming training games with a different network to find conflicts and to resolve them favorably.

Thanks to aart for discussion and suggestions on this issue.

jhorthos commented 5 years ago

Results on tactical complementarity relevant to above.

Formal assessment of complementary tactical skill among training runs

Background: As exemplified by the well know multiple Queen issue in test30 and the promotion-capture issue in test40, I hypothesize that the best networks from different training runs have significantly different tactical skills and blind spots. I tested this as follows: 1) extracted problems from jhorthos problem suite where the best move(s) exceeded the next best move by at least 1.2 pawns according to Stockfish10 run to 20 billion nodes. The purpose was to raise confidence that the correct answer is objectively correct. 2) tested problem solving with one selected high Elo net from each main training runs: IDs 11248, 21850, 32930, 41511. 3) tested how often different networks solved different tactical problems.

Summary of results: under all test conditions, at least 1 of the 4 nets solved far more problems than any single net. Little of this excess was due to variability in single net performance, because repeated runs with a single net solved nearly the same set of problems each time. The effect was stronger as nodes per problem went down.

Conclusion: Networks from different training runs have substantially different tactical skills and blind spots. We can remain zero and take advantage of this in future training runs by blending in some training games played against high Elo networks from previous runs, hopefully preventing tactical blind spots not shared by all of them.

Specific result summaries (RTX 2080):

8 sec per problem (about 175K nodes): 669 problems, 4 test(s) 582 correct in at least one test 419 correct in all tests 250 wrong in at least one test 87 wrong in all tests 516 correct in best individual test delta = 66 (9.9%) (number solved by any network minus the best individual network) distribution: test(s) wrong (any wrong answer tracked, list how many of 4 tests got that problem wrong) 1 66 2 52 3 45 4 87

2 sec per problem (about 35K nodes): 669 problems, 4 test(s) 565 correct in at least one test 375 correct in all tests 294 wrong in at least one test 104 wrong in all tests 487 correct in best individual test delta = 78 (11.7%) distribution: test(s) wrong 1 79 2 54 3 57 4 104

1600 nodes per problem (training parameters): 669 problems, 4 test(s) 487 correct in at least one test 255 correct in all tests 414 wrong in at least one test 182 wrong in all tests 380 correct in best individual test delta = 107 (16.0%) distribution: test(s) wrong 1 74 2 79 3 79 4 182

800 nodes per problem (training parameters): 669 problems, 4 test(s) 481 correct in at least one test 238 correct in all tests 431 wrong in at least one test 188 wrong in all tests 359 correct in best individual test delta = 122 (18.2%) distribution: test(s) wrong 1 82 2 62 3 99 4 188

Notes: The 1600 and 800 node tests always produce the same answer from repeated runs due to training parameters used (--threads=1 --no-out-of-order-eval etc). The 2 sec and 8 sec tests had slight differences in repeated runs with one net. For the 2 sec per problem test, ID 11248 run 4 times independently had 17 problems solved at least once compared to the mean of the individual runs (delta 17 compared to delta 78 above). For the 8 sec per problem test, ID 11248 run 4 times independently had 13 problems solved at least once compared to the mean of the individual runs (delta 13 compared to delta 66 above).

jhorthos commented 5 years ago

I would love to hear from someone who can help on code changes to client.exe that implement option 1 above (the simplest and probably best). The main thing that needs to be done is to allow training-game play to occur between 2 different nets. I will try to work on it but my C++ skills are weak at this time.

webreh commented 5 years ago

jhorthos, can you make the same tactic tests for networks withing the same run? Something like "each end of the week best in test40". I think both the intact and rapidly changing tactic ability will be an interesting effect.

jhorthos commented 5 years ago

webreh - already done. see !sheet3 from discord, test40 tactics tab

mooskagh commented 5 years ago

Weren't problems with capture promotion due to gamma normalization bug, rather than due to too sparse training positions?

oscardssmith commented 5 years ago

We don't know what caused capture promote issues. There's no reason to suspect gamma.

mooskagh commented 5 years ago

I think it was confirmed to be gamma problem (there was no regularization or something), and that was already fixed, see #784

tjr1 commented 5 years ago

A similar idea is to simultaneously train a small "pool" of Leela's (maybe 3 or 4). Each would be "zero," just starting from different initial random weights. They would develop strategies against blindspots of the other nets. They would play each other as well as do self-play. The intuition is that an agent can't easily exploit a blindspot during self-play, if it has the same blindspot, so by having diverse pool there would be a stronger learning signal due to the competition.

At the end, a meta net could be bootstrapped from the pool. There is ongoing research on how to combine networks to improve generalization, so having a pool of 3 or 4 Leela's wouldn't be a waste in the end.

Having multiple on-policy learners might be faster than a single learner. This would require more training and more games, but it would be an interesting way to add robustness and generalization.

A minor detail would be that each net would receive as input the identity of which net is playing against, so that it can develop strategies specific to each competitor and continue to exploit the other's blindspot until it is fixed.

In the end, ideas are cheap, dev time is expensive, and I know that adding to the pile of ideas is just that.

Naphthalin commented 4 years ago

In the meantime we used multinet as an approximation of this idea in T58 which didn't produce any results but a 1.5x slowdown from not being able to reuse the NN cache as effiiently (and it would have been a 3x slowdown if we only created training data for the new net). As the "full" feature would be training against an old net, we would get at least this 3x resource cost or more.

@jhorthos does this issue address anything which isn't answered by having tried multi-net in training, or can it be closed?

webreh commented 4 years ago

@Naphthalin have anyone ever tried to train NN to predict a match result for two players? Will Elo be accurate enough or the strength in different aspects of game are detectable? I think one can not simply take a small random pool of strategies playing against each other believing any small progress will persist (I think evolution biology will tell you require like at least 1000 individuals and gene mixing) but something like 'artificial selection' and 'playing against your weaknesses' may work

Naphthalin commented 1 year ago

@jhorthos Does borg's action replay PR #1682 address everything you mention in your issue? It obviously allows using games between two different nets as it allows arbitrary PGNs for replay.