Closed Mardak closed 5 years ago
Averaging this training data for the move across 50 games should cause P to move towards 16.3%
I think this is probably good enough. As it moves towards 16.3% it will accelerate and move up even faster.
Also generally I think we should not do any of these sort of changes that try to improve on the paper until after we fix things that are probably wrong such as rule50.
As it moves towards 16.3% it will accelerate and move up even faster.
Yes, in fact from all the runs, once the prior for this particular move reaches P: 1.28%, it'll start driving at least 100 visits of 800 towards it. Similarly, once it gets to P: 2.25%, over 700 visits will go to it for near 90% average tactic training. I.e., networks are not in a virtuous cycle of self-learning for this tactic yet.
However, if you look at the data showing the priors for this move across the various network ids, the prior has stayed around 0.3%. And from that same data, it shows that nearly all those networks would have put over 700 visits into the move if only search initially had 2 visits to it.
That means even with the current "16.3% average tactic training," there is way more training data that drives the prior towards 0.3% instead of higher. This means across 250 network generations, the existing "16.3%" noise has been unable to get it to learn this tactic when 2 visits would have.
The new network prior approaches ((16.3% * number of similar board state) + (0% * other board states)) / total board states
. If someone has any suggestions to how to measure the number of these learnable board states, that would be great. (Although, I doubt we would do anything proactively to try to increase that number of similar board states, but we might do something to increase the 16.3%.)
@ASilver requested running with 3.1 PUCT, and I see that the latest lc0 master https://github.com/LeelaChessZero/lc0/commit/2321011913d8a7914d8177e82ceaf34fbe2d6ee8 uses that and gets 24.9%. The earlier runs were against then-next https://github.com/LeelaChessZero/lc0/commit/50542694af7c5e50d8c4d5a60f57a54d9247cf88 with 16.3% average tactic training.
Here's a graph of testing various PUCT at noised root:
diff --git a/src/mcts/search.cc b/src/mcts/search.cc
--- a/src/mcts/search.cc
+++ b/src/mcts/search.cc
@@ -677 +677 @@ std::pair<Node*, bool> Search::PickNodeToExtend(Node* node,
- float factor = kCpuct * std::sqrt(std::max(node->GetChildrenVisits(), 1u));
+ float factor = (is_root_node && kNoise ? 3.5f : kCpuct) * std::sqrt(std::max(node->GetChildrenVisits(), 1u));
Here's some more analysis on other board states from https://github.com/glinscott/leela-chess/issues/698#issuecomment-393666516:
Here's the "average tactic training" for various engine configurations and board states:
config | SCTR/359 | 359/Wasp | 351/EXch | 351/Hakk | iCE/351 | 351/Bobc |
---|---|---|---|---|---|---|
root PUCT 1.2 | 19.9% | 12.7% | 38.9% | 57.6% | 32.8% | 40.4% |
default | 24.9% | 18.8% | 28.6% | 63.3% | 42.7% | 44.3% |
ε 0.5 | 31.5% | 38.1% | 49.1% | 56.9% | 46.7% | 51.0% |
α 3.0 | 52.8% | 35.4% | 77.2% | 75.1% | 71.7% | 78.3% |
ε 0.5, α 3.0 | 69.9% | 65.0% | 88.8% | 77.4% | 79.4% | 74.6% |
twice | 86.0% | 13.0% | 87.3% | 74.2% | 83.5% | 81.5% |
These are all tested with current master https://github.com/LeelaChessZero/lc0/commit/2321011913d8a7914d8177e82ceaf34fbe2d6ee8 where default is PUCT 3.1, α 0.3, ε 0.25, no twice visits. The patches for root PUCT and twice visits are in earlier comments, and noise changes are just changing ApplyDirichletNoise(node, 0.25, 0.3)
call.
I used the networks listed (id351 or id359) as notably, the latest id369 has learned the tactic from the Hakkapeliitta game where with PUCT 1.2 (lczero 0.6), an average tactic training of 57.6% was enough to learn it.
Here's the priors for each of the expected moves:
game | old prior | id369 prior |
---|---|---|
SCTR | 0.33% | 0.43% |
Wasp | 0.23% | 0.19% |
EXchess | 0.03% | 0.02% |
Hakkapeliitta | 0.32% | 50.61% |
iCE | 0.08% | 0.12% |
Bobcat | 0.14% | 0.32% |
So looks like noise is indeed working, and with PUCT change to 3.1, there'll be less of a need to do additional changes to improve training tactics.
I analyzed all the CCLS id359 games to find low prior moves that were played to see if the same network would have found it with noise. Here's a first set that the other AI played that the network didn't really considered at all. Not all have major swings in win rate or even change the outcome, but at least this first one against Houdini, lczero thought it was 63% win rate but after the 0.15% prior move, it's actually the opponent with 75% win rate.
For each unexpected played move, a screenshot and uci position and what id359 thought of it and the top alternate moves when forced to explore at least 10 visits of 800:
And same analysis as before with 50 noised games from the above board states to calculate the average training:
config | Houdini | Naum | Scorpio | Protector | Vajolet | Cheng |
---|---|---|---|---|---|---|
root PUCT 1.2 | 6.8% | 2.2% | 4.5% | 1.3% | 0.3% | 19.9% |
default | 8.2% | 21.3% | 20.1% | 0.6% | 4.7% | 15.8% |
ε 0.5 | 19.6% | 36.5% | 23.2% | 4.8% | 9.5% | 33.5% |
α 3.0 | 1.3% | 17.2% | 26.5% | 0.3% | 0.3% | 38.0% |
ε 0.5, α 3.0 | 10.0% | 53.4% | 57.6% | 1.0% | 5.2% | 55.5% |
twice | 7.6% | 10.5% | 57.0% | 2.8% | 1.9% | 32.2% |
I rebased @ASilver's params from #46 onto https://github.com/LeelaChessZero/lc0/commit/2321011913d8a7914d8177e82ceaf34fbe2d6ee8 where I did the tests from earlier. Analyzing the same 12 games from earlier with the same networks each with 50 runs of noise, the average training does go up quite a bit. This is most likely from softmax, as it increases the policy quite a bit in each of these cases, where the usual priors are much lower. I've included the training numbers and move prior for default and with the adjusted params:
game | default training | adjusted training | default prior | adjusted prior |
---|---|---|---|---|
SCTR | 24.9% | 66.9% | 0.33% | 1.81% (5.5x) |
Wasp | 18.8% | 77.2% | 0.23% | 1.68% (7.3x) |
EXchess | 28.6% | 81.0% | 0.03% | 0.60% (20x) |
Hakkapeliitta | 78.9% | 78.7% | 0.32% | 2.14% (6.7x) |
iCE | 42.7% | 77.7% | 0.08% | 0.97% (12x) |
Bobcat | 44.3% | 75.5% | 0.14% | 3.03% (22x) |
Houdini | 8.2% | 19.5% | 0.15% | 1.02% (6.8x) |
Naum | 21.3% | 39.0% | 0.05% | 0.87% (17x) |
Scorpio | 20.1% | 50.3% | 0.07% | 1.09% (16x) |
Protector | 0.6% | 5.0% | 0.10% | 0.86% (8.6x) |
Vajolet | 4.7% | 14.8% | 0.17% | 1.32% (7.8x) |
Cheng | 15.8% | 53.0% | 0.01% | 0.38% (38x) |
As @killerducky pointed out earlier, we probably shouldn't touch the training until other things are fixed, so these numbers are at least reassuring in that if the network never got so biased to avoid these moves to begin with, they would pretty naturally find these correct moves with the default noise settings.
In terms of network progression, even with a clean start, priors could end up very low like these 0.0x% numbers because the value head hadn't learned to favor a position, so training search would give less visits. But I suppose that could be something revisited later if it seems to be stuck again and failing to generate useful training.
There was a request to check the latest lc0 test id14 4ce96dba to see what it thought of each of these games. Looks like the network already avoids most of these moves except a couple games and has trouble generating training data to increase prior for the move. Below are the average training using default noise as well as the priors for the move:
game | training | prior |
---|---|---|
SCTR | 10.8% | 0.06% |
Wasp | 8.2% | 0.14% |
EXchess | 2.0% | 0.04% |
Hakkapeliitta | 98.8% | 94.69% |
iCE | 23.9% | 0.27% |
Bobcat | 92.6% | 2.89% |
Houdini | 5.5% | 0.20% |
Naum | 16.4% | 0.07% |
Scorpio | 1.2% | 0.07% |
Protector | 0.2% | 0.07% |
Vajolet | 1.6% | 0.16% |
Cheng | 0.8% | 0.45% |
Here's a check to see if the network would have found the move if forced to explore at least 20 visits out of 2000:
sctr Rxh4 -> 1529 ( 76.45%) (V: 74.10%) (N: 0.06%) PV: Rxh4 Nf4 Rg4 Ng6 Rc4 Rxc4 Bxc4 Kd7 Bf7 Ne5 Bd5 Kd6 Bb3 Nc6 Bf7 Ne5 Bh5 Ke6 Bg7
wasp Rxe3 -> 531 ( 26.54%) (V: 57.76%) (N: 0.14%) PV: Rxe3 Qxc3 Rxc3 hxg7 Nxc4 bxc4 Kxg7 Be3 Rxc4 Bd5 Ra4 Rc1 c5 Bb3 Rab4 Bd5 Ra4 Bb3
exchess Nf6 -> 1169 ( 58.45%) (V: 68.48%) (N: 0.04%) PV: Nf6 Rxd6 Qxf3 Bxf6 Kxf6 axb4 Nf5 Rd7 Bxb4 h5 Rc2 Qg6+ Ke5 Rf1 Bc5 g4 Nxe3 Qxg7+ Ke4
hakkapeliitta Bxd1 -> 1456 ( 72.76%) (V: 61.50%) (N: 94.69%) PV: Bxd1 Rxd1 Kf3 Rd3 Kg2 Ke2 Kxh2 Kf2 Kh1 Rd1+ Kh2 Kf3 Rbb8 g4 fxg4+ Kxg4 Kg2 Rd2+ Kf1 Kxh3 Ke1 Kg4 Rb6 Kg5 Rg8 Kf6 Rf8+ Ke7 Rc8
ice Bc6+ -> 1392 ( 69.60%) (V: 68.57%) (N: 0.27%) PV: Bc6+ Kxc6 Rxf4 Rb8 Nc2 e5 Rxh4 Rb3 Rh3 Kd5 g4 Rb1+ Re1 Rxe1+ Nxe1 c4 dxc4+ Kxc4 g5
bobcat Bxh6 -> 1594 ( 79.70%) (V: 65.11%) (N: 2.89%) PV: Bxh6 c4 bxc4 Bb4 Be3 Rd7+ Ke4 Rxd3 Kxd3 Bxa5 h6 Kf7 c5 Kg8 Kc4 Kh7 Kd5 Bd8 Kd6
houdini Qxe7+ -> 1254 ( 62.67%) (V: 76.36%) (N: 0.20%) PV: Qxe7+ Qxe7 Rxe7 Kxe7 Nxd5+ Kd6 Rxc4 Kxd5 Rf4 Ke5 g3 g5 Rf7 Ke6 Bxg4+ Kxf7 Bxd7 Ke7 Bxh3 Kf6 Bc8 Bd4 Bxb7
naum Qxd5 -> 1176 ( 58.80%) (V: 54.99%) (N: 0.07%) PV: Qxd5 dxe8=Q Nxf3+ Kg2 Nxd2+ cxd5 Rxe8 Rd1 Nc4 Kf3 Nd6 Kxf4 Re2 a4 c4 Ne3 Rxf2+ Ke5 Rf6 Kd4
scorpio Rexe7 -> 1226 ( 61.21%) (V: 64.01%) (N: 0.07%) PV: Rexe7 Rxe7 Rxd8+ Kg7 d6 Re1+ Kh2 Qc1 Rd7 Rh1+ Kg3 Qg5+ Qg4 Qe5+ Qf4 Qxc3+ Kh4 Qf6+ Qxf6+ Kxf6 Ra7 Rd1 d7 Rd4+ Kg3 a4 Kf3 Ke7 Ke3 Rxd7
protector Nf5+ -> 782 ( 39.08%) (V: 54.66%) (N: 0.07%) PV: Nf5+ Kf8 Nd6 Nc6 f4 Bc5+ Kh1 Bxd6 exd6 Bd7 Rad1 Rd8 Bb5 a6 Bxc6 Bxc6 f5
vajolet Nxg4 -> 40 ( 1.58%) (V: 40.04%) (N: 0.16%) PV: Nxg4 Bxg4 Qxb5 f6 Bxf6 Bxc8 Rxc8 Qg2 Nd7 Qh3 Rc2 Qe6+ Kh8 Qe8+ Nf8
cheng Bf4+ -> 708 ( 35.40%) (V: 48.03%) (N: 0.45%) PV: Bf4+ Kb1 Qg4 Rhg1 Qf5 Ne2 Nd5 Nd4 Qe5 Nf3 Qf5 Nh4 Qe6 Nc5 Qd6 Ne4 Qc6
So the network with its value and policy could generate valuable training data if search was biased the right away in almost all the cases. Unclear if this is a limitation on the size of the network.
I'm not sure if it's concerning yet that the network would generate average training quite a bit less compared to the previous comment with default training.
Here's a look at the prior progression for moves from Scorpio vs id359 Round 2 from https://github.com/LeelaChessZero/lc0/issues/8#issuecomment-394565160 using the lc0 test networks:
For reference, here's the board and moves id359 would have found when forced:
position startpos moves g1f3 d7d5 e2e3 c7c5 d2d4 g8f6 c2c4 c5d4 e3d4 b8c6 c4d5 f6d5 b1c3 g7g6 f1c4 d5b6 c4b3 f8g7 e1g1 e8g8 d4d5 c6a5 f1e1 a5b3 a2b3 c8g4 h2h3 g4f3 d1f3 f8e8 c1e3 g7c3 e3b6 d8b6 b2c3 b6b3 a1b1 b3a3 b1b7 a7a5 b7d7 a8d8
info string Rb7 -> 11 ( 1.37%) (V: 46.50%) (N: 10.25%) PV: Rb7 Qc5 Rd1 Rb8 Rdb1 Rxb7
info string Rxd8 -> 13 ( 1.62%) (V: 41.82%) (N: 15.61%) PV: Rxd8 Rxd8 Qe3 Rxd5 c4 Qxe3 Rxe3
info string Rc7 -> 26 ( 3.25%) (V: 47.32%) (N: 21.27%) PV: Rc7 a4 Qe4 Ra8 c4 Qc3
info string Ra7 -> 44 ( 5.49%) (V: 41.85%) (N: 48.93%) PV: Ra7 Qc5 Qe3 Rxd5 Qxc5 Rxc5 Rexe7 Rxe7 Rxe7 Rxc3 Ra7 Ra3 g3 a4 Kg2 Kg7
info string Rexe7 -> 360 ( 44.94%) (V: 61.59%) (N: 0.07%) PV: Rexe7 Qxe7 Rxe7 Rxe7 c4 Rc7 Qf6 Rcc8 d6 a4 Qd4 a3 c5
I analyzed id396 from @scs-ben compared to id395 for these positions, and some have significantly better average tactics training even though the same training data resulted in similar priors for the correct move for both networks. The main difference is that the value evaluation for the expected move is generally favorable, which drives more visits during search.
The Scorpio game in the previous comment shows the new network believes the move is winning (V: 3.05%) where the previous network thinks it's losing (V: -38.69%), so even though priors are very low around 0.2% for both, the id396's value results in 60.3% training up from 18.0%!
@killerducky is this change in value expected from 50-normalization? It should definitely help generate better training data in some cases! 👍
game | 395 train | 395 V | 395 P | 396 train | 396 V | 396 P |
---|---|---|---|---|---|---|
sctr | 19.2% | -46.53% | 0.33% | 43.2% | -4.32% | 0.91% |
wasp | 37.9% | 38.19% | 0.16% | 69.5% | 13.13% | 2.53% |
exchess | 33.8% | -10.42% | 0.58% | 39.4% | 22.87% | 0.22% |
hakka | 56.0% | -25.54% | 6.09% | 100.0% | -3.53% | 69.51% |
ice | 34.9% | -34.94% | 0.20% | 50.0% | -17.78% | 0.45% |
bobcat | 52.3% | 43.05% | 0.75% | 56.2% | 39.69% | 0.26% |
houdini | 6.9% | -53.02% | 1.02% | 3.2% | -51.08% | 0.29% |
naum | 17.1% | -84.85% | 0.11% | 12.1% | -79.72% | 0.63% |
scorpio | 18.0% | -38.69% | 0.15% | 60.3% | 3.05% | 0.26% |
protector | 4.3% | -44.56% | 1.89% | 3.0% | -43.12% | 0.55% |
vajolet2 | 1.3% | -67.63% | 2.56% | 9.7% | -55.49% | 1.69% |
cheng | 20.8% | -44.04% | 1.78% | 4.2% | -59.30% | 8.54% |
For reference in the Scorpio game, here's id396 V for each move showing two moves have positive V:
info string f3f5 (589 ) N: 2 (+ 0) (V: -89.32%) (P: 1.01%) (Q: -0.91407) (U: 0.30353) (Q+U: -0.61053)
info string e1e3 (111 ) N: 2 (+ 0) (V: -64.44%) (P: 0.90%) (Q: -0.72748) (U: 0.27000) (Q+U: -0.45748)
info string e1e4 (115 ) N: 2 (+ 0) (V: -66.91%) (P: 0.37%) (Q: -0.73500) (U: 0.11179) (Q+U: -0.62322)
info string d7e7 (1480) N: 2 (+ 0) (V: -14.48%) (P: 1.97%) (Q: -0.42458) (U: 0.59003) (Q+U: 0.16545)
info string d7d6 (1474) N: 2 (+ 0) (V: -83.71%) (P: 0.11%) (Q: -0.83561) (U: 0.03143) (Q+U: -0.80418)
info string e1e5 (118 ) N: 2 (+ 0) (V: -67.00%) (P: 0.48%) (Q: -0.73867) (U: 0.14478) (Q+U: -0.59389)
info string d5d6 (1007) N: 2 (+ 0) (V: -46.27%) (P: 0.80%) (Q: -0.59301) (U: 0.23789) (Q+U: -0.35512)
info string e1e6 (119 ) N: 2 (+ 0) (V: -71.98%) (P: 1.23%) (Q: -0.77920) (U: 0.36941) (Q+U: -0.40979)
info string f3d1 (565 ) N: 2 (+ 0) (V: -71.51%) (P: 0.25%) (Q: -0.74729) (U: 0.07373) (Q+U: -0.67356)
info string f3e2 (571 ) N: 2 (+ 0) (V: -62.67%) (P: 2.14%) (Q: -0.69184) (U: 0.63923) (Q+U: -0.05261)
info string f3e4 (583 ) N: 2 (+ 0) (V: -70.92%) (P: 0.21%) (Q: -0.75161) (U: 0.06385) (Q+U: -0.68776)
info string f3h5 (591 ) N: 2 (+ 0) (V: -90.01%) (P: 0.20%) (Q: -0.91669) (U: 0.06100) (Q+U: -0.85569)
info string e1f1 (101 ) N: 2 (+ 0) (V: -68.73%) (P: 0.14%) (Q: -0.74611) (U: 0.04294) (Q+U: -0.70317)
info string f3d3 (578 ) N: 2 (+ 0) (V: -71.21%) (P: 0.08%) (Q: -0.75146) (U: 0.02537) (Q+U: -0.72608)
info string f3e3 (579 ) N: 2 (+ 0) (V: -56.49%) (P: 0.21%) (Q: -0.66993) (U: 0.06261) (Q+U: -0.60732)
info string f3g3 (580 ) N: 2 (+ 0) (V: -67.80%) (P: 0.20%) (Q: -0.73818) (U: 0.05993) (Q+U: -0.67825)
info string f3f7 (595 ) N: 2 (+ 0) (V: -59.11%) (P: 0.13%) (Q: -0.75464) (U: 0.03975) (Q+U: -0.71488)
info string f3f6 (593 ) N: 2 (+ 0) (V: -79.17%) (P: 0.60%) (Q: -0.86619) (U: 0.17863) (Q+U: -0.68756)
info string e1d1 (100 ) N: 2 (+ 0) (V: -66.71%) (P: 0.14%) (Q: -0.73520) (U: 0.04126) (Q+U: -0.69394)
info string f3f4 (584 ) N: 2 (+ 0) (V: -68.33%) (P: 0.42%) (Q: -0.74719) (U: 0.12697) (Q+U: -0.62022)
info string c3c4 (485 ) N: 2 (+ 0) (V: -38.13%) (P: 2.11%) (Q: -0.54246) (U: 0.63084) (Q+U: 0.08838)
info string g2g4 (378 ) N: 2 (+ 0) (V: -70.85%) (P: 0.25%) (Q: -0.76396) (U: 0.07510) (Q+U: -0.68886)
info string g2g3 (374 ) N: 2 (+ 0) (V: -73.14%) (P: 0.45%) (Q: -0.77798) (U: 0.13459) (Q+U: -0.64339)
info string g1h2 (157 ) N: 2 (+ 0) (V: -70.19%) (P: 0.08%) (Q: -0.76420) (U: 0.02368) (Q+U: -0.74052)
info string g1h1 (153 ) N: 2 (+ 0) (V: -70.44%) (P: 1.54%) (Q: -0.76120) (U: 0.45974) (Q+U: -0.30146)
info string g1f1 (152 ) N: 2 (+ 0) (V: -72.05%) (P: 1.11%) (Q: -0.77256) (U: 0.33322) (Q+U: -0.43934)
info string e1a1 (97 ) N: 2 (+ 0) (V: -74.85%) (P: 0.07%) (Q: -0.82456) (U: 0.02079) (Q+U: -0.80377)
info string e1b1 (98 ) N: 2 (+ 0) (V: -65.31%) (P: 0.09%) (Q: -0.72224) (U: 0.02801) (Q+U: -0.69423)
info string e1c1 (99 ) N: 2 (+ 0) (V: -80.15%) (P: 0.12%) (Q: -0.85486) (U: 0.03508) (Q+U: -0.81978)
info string e1e2 (106 ) N: 3 (+ 0) (V: -71.59%) (P: 4.33%) (Q: -0.76985) (U: 0.97152) (Q+U: 0.20167)
info string h3h4 (642 ) N: 4 (+ 0) (V: -68.01%) (P: 5.55%) (Q: -0.75801) (U: 0.99557) (Q+U: 0.23756)
info string f3g4 (585 ) N: 9 (+ 1) (V: -14.25%) (P: 3.15%) (Q: 0.00519) (U: 0.25699) (Q+U: 0.26218)
info string d7d8 (1486) N: 22 (+ 1) (V: 13.70%) (P: 9.31%) (Q: -0.09348) (U: 0.34820) (Q+U: 0.25472)
info string d7c7 (1479) N: 36 (+ 2) (V: -8.47%) (P: 16.93%) (Q: -0.14014) (U: 0.38951) (Q+U: 0.24937)
info string d7b7 (1478) N: 51 (+ 0) (V: -4.70%) (P: 13.46%) (Q: -0.04031) (U: 0.23233) (Q+U: 0.19203)
info string d7a7 (1477) N: 63 (+ 4) (V: -10.19%) (P: 29.58%) (Q: -0.12718) (U: 0.39032) (Q+U: 0.26314)
info string e1e7 (120 ) N: 592 (+118) (V: 3.05%) (P: 0.26%) (Q: 0.23203) (U: 0.00033) (Q+U: 0.23236)
Whereas with id395, only one move:
info string f3f5 (589 ) N: 2 (+ 0) (V: -94.07%) (P: 0.09%) (Q: -0.94612) (U: 0.02797) (Q+U: -0.91815)
info string e1e3 (111 ) N: 2 (+ 0) (V: -68.33%) (P: 0.12%) (Q: -0.74048) (U: 0.03665) (Q+U: -0.70383)
info string e1e4 (115 ) N: 2 (+ 0) (V: -72.54%) (P: 0.03%) (Q: -0.76590) (U: 0.00885) (Q+U: -0.75705)
info string d7e7 (1480) N: 2 (+ 0) (V: -46.42%) (P: 0.31%) (Q: -0.61692) (U: 0.09210) (Q+U: -0.52482)
info string d7d6 (1474) N: 2 (+ 0) (V: -87.56%) (P: 0.06%) (Q: -0.86254) (U: 0.01934) (Q+U: -0.84320)
info string e1e5 (118 ) N: 2 (+ 0) (V: -72.77%) (P: 0.21%) (Q: -0.77803) (U: 0.06221) (Q+U: -0.71582)
info string d5d6 (1007) N: 2 (+ 0) (V: -41.78%) (P: 0.30%) (Q: -0.59649) (U: 0.09057) (Q+U: -0.50592)
info string h3h4 (642 ) N: 2 (+ 0) (V: -67.54%) (P: 1.28%) (Q: -0.72365) (U: 0.38191) (Q+U: -0.34174)
info string f3d1 (565 ) N: 2 (+ 0) (V: -76.02%) (P: 0.53%) (Q: -0.79143) (U: 0.15995) (Q+U: -0.63148)
info string f3e2 (571 ) N: 2 (+ 0) (V: -70.15%) (P: 0.11%) (Q: -0.75128) (U: 0.03408) (Q+U: -0.71720)
info string f3e4 (583 ) N: 2 (+ 0) (V: -67.89%) (P: 0.27%) (Q: -0.74062) (U: 0.07949) (Q+U: -0.66113)
info string f3h5 (591 ) N: 2 (+ 0) (V: -88.92%) (P: 0.19%) (Q: -0.91880) (U: 0.05676) (Q+U: -0.86204)
info string e1e6 (119 ) N: 2 (+ 0) (V: -72.22%) (P: 0.23%) (Q: -0.77462) (U: 0.06997) (Q+U: -0.70466)
info string f3d3 (578 ) N: 2 (+ 0) (V: -74.92%) (P: 1.75%) (Q: -0.78349) (U: 0.52235) (Q+U: -0.26115)
info string e1d1 (100 ) N: 2 (+ 0) (V: -69.40%) (P: 0.10%) (Q: -0.75805) (U: 0.02927) (Q+U: -0.72878)
info string f3g3 (580 ) N: 2 (+ 0) (V: -76.35%) (P: 0.16%) (Q: -0.78368) (U: 0.04775) (Q+U: -0.73594)
info string f3f7 (595 ) N: 2 (+ 0) (V: -73.75%) (P: 0.60%) (Q: -0.83331) (U: 0.17828) (Q+U: -0.65503)
info string f3f6 (593 ) N: 2 (+ 0) (V: -79.65%) (P: 0.07%) (Q: -0.87256) (U: 0.02183) (Q+U: -0.85074)
info string e1e2 (106 ) N: 2 (+ 0) (V: -72.63%) (P: 0.45%) (Q: -0.76863) (U: 0.13469) (Q+U: -0.63394)
info string f3f4 (584 ) N: 2 (+ 0) (V: -67.04%) (P: 0.16%) (Q: -0.72483) (U: 0.04779) (Q+U: -0.67704)
info string e1c1 (99 ) N: 2 (+ 0) (V: -89.23%) (P: 0.78%) (Q: -0.91270) (U: 0.23458) (Q+U: -0.67812)
info string g2g4 (378 ) N: 2 (+ 0) (V: -69.55%) (P: 0.09%) (Q: -0.76088) (U: 0.02782) (Q+U: -0.73306)
info string g2g3 (374 ) N: 2 (+ 0) (V: -74.27%) (P: 0.38%) (Q: -0.78336) (U: 0.11372) (Q+U: -0.66964)
info string g1h2 (157 ) N: 2 (+ 0) (V: -75.21%) (P: 0.06%) (Q: -0.78616) (U: 0.01766) (Q+U: -0.76850)
info string g1h1 (153 ) N: 2 (+ 0) (V: -75.01%) (P: 0.60%) (Q: -0.78633) (U: 0.17805) (Q+U: -0.60828)
info string g1f1 (152 ) N: 2 (+ 0) (V: -75.11%) (P: 1.99%) (Q: -0.79003) (U: 0.59606) (Q+U: -0.19397)
info string e1a1 (97 ) N: 2 (+ 0) (V: -81.70%) (P: 3.31%) (Q: -0.87187) (U: 0.98900) (Q+U: 0.11714)
info string e1b1 (98 ) N: 2 (+ 0) (V: -69.44%) (P: 0.52%) (Q: -0.75769) (U: 0.15536) (Q+U: -0.60232)
info string c3c4 (485 ) N: 3 (+ 0) (V: -27.98%) (P: 0.11%) (Q: -0.40043) (U: 0.02501) (Q+U: -0.37541)
info string f3e3 (579 ) N: 3 (+ 0) (V: -62.50%) (P: 4.24%) (Q: -0.74483) (U: 0.95081) (Q+U: 0.20598)
info string f3g4 (585 ) N: 3 (+ 0) (V: -24.01%) (P: 0.92%) (Q: -0.10577) (U: 0.20585) (Q+U: 0.10008)
info string e1f1 (101 ) N: 4 (+ 0) (V: -74.66%) (P: 6.08%) (Q: -0.81843) (U: 1.09180) (Q+U: 0.27337)
info string d7d8 (1486) N: 17 (+ 0) (V: -5.86%) (P: 11.04%) (Q: -0.26225) (U: 0.55060) (Q+U: 0.28835)
info string d7b7 (1478) N: 43 (+ 0) (V: -3.24%) (P: 8.22%) (Q: -0.05280) (U: 0.16761) (Q+U: 0.11482)
info string d7c7 (1479) N: 61 (+ 0) (V: 5.63%) (P: 19.26%) (Q: -0.09869) (U: 0.27884) (Q+U: 0.18016)
info string d7a7 (1477) N: 87 (+ 0) (V: -4.53%) (P: 35.05%) (Q: -0.08299) (U: 0.35741) (Q+U: 0.27442)
info string e1e7 (120 ) N: 561 (+55) (V: -38.69%) (P: 0.32%) (Q: 0.28310) (U: 0.00046) (Q+U: 0.28356)
I wouldn't say it's directly expected. But r50 was breaking the net, so with it fixed hopefully the net will fix other things too, or we will find the next problem.
There looks to be quite a bit of difference between id401 and id402. I'm surprised at how much the value can change in just one network.
game | 401 train | 401 V | 401 P | 402 train | 402 V | 402 P |
---|---|---|---|---|---|---|
sctr | 23.3% | -39.33% | 0.81% | 50.2% | 9.49% | 1.50% |
wasp | 55.5% | 23.36% | 0.55% | 70.7% | 46.39% | 1.31% |
exchess | 62.4% | -7.87% | 1.09% | 40.3% | 52.42% | 0.21% |
hakka | 100.0% | -17.70% | 71.05% | 100.0% | -11.81% | 68.55% |
ice | 30.2% | -19.84% | 0.29% | 47.2% | 19.62% | 0.81% |
bobcat | 74.5% | 21.60% | 0.36% | 69.8% | 54.93% | 6.07% |
houdini | 0.4% | -42.82% | 0.68% | 1.6% | -41.09% | 0.22% |
naum | 19.8% | -72.85% | 2.93% | 37.9% | -37.60% | 0.89% |
scorpio | 29.1% | -13.97% | 1.21% | 41.7% | -19.40% | 0.30% |
protector | 1.2% | -57.93% | 1.65% | 3.1% | -37.20% | 1.44% |
vajolet2 | 1.7% | -74.91% | 1.87% | 7.6% | -45.64% | 3.04% |
cheng | 2.6% | -72.67% | 0.97% | 5.3% | -50.46% | 0.26% |
Here's the progression of priors for the 3 lc0 tests so far. Looks like after the learning rate change to 0.01 for test 3, the max change in prior was reduced from id250 averaging 7%, but then after id304, the changes jumped up to an average 23%.
Here's the behavior of test 3 searching for the best move for each of the board positions from this issue (i.e., load the position then go nodes 800
and see how many visits out of 800 without smart pruning):
Looks like it pretty solidly learned two: hakkapeliitta and bobcat; and conflicted learning two: houdini and ice. There were brief blips of learning then forgetting sctr, exchess and scorpio. And never really ever considered wasp, naum, protector, vajolet and cheng.
I would guess the conflicted learning is that the NN sees it like other similar positions, so it's constantly training two or more different correct moves. I'm not sure if this would be addressed by a larger network that could differentiate the positions better?
Here's the same analysis for Test 1 also learning hakkapeliitta and bobcat; briefly exploring houdini and ice; and none of the others after the initial noise:
And Test 2 also learned hakkapeliitta and bobcat; conflicted for sctr (and maybe scorpio at the end?); and none of the others after the initial noise:
For reference, id395 and later main networks after 50-normalization have learned only just hakkapeliitta and none of the others.
Edit: Test 4 (?? normally numbered 1-57, but I added 500):
Edit: Test 8 including value-only/policy-less search as dotted lines:
Rerunning the original "SCTR" position with 11089 with varying visits (no noise, no softmax, no aversion):
800: info string a4h4 (666 ) N: 0 (+ 0) (P: 0.62%) (Q: -1.11990) (U: 0.59557) (Q+U: -0.52433) (V: -.----)
1600: info string a4h4 (666 ) N: 0 (+ 0) (P: 0.62%) (Q: -1.12639) (U: 0.84253) (Q+U: -0.28386) (V: -.----)
3200: info string a4h4 (666 ) N: 903 (+ 1) (P: 0.62%) (Q: 0.68225) (U: 0.00132) (Q+U: 0.68357) (V: 0.1151)
6400: info string a4h4 (666 ) N: 4102 (+ 1) (P: 0.62%) (Q: 0.65462) (U: 0.00041) (Q+U: 0.65503) (V: 0.1151)
Those would estimate average policy training from the existing 0.62% to: 0%, 0%, 28%, 64%
And with "Wasp" position:
800: info string e6e3 (560 ) N: 146 (+ 1) (P: 1.15%) (Q: 0.61700) (U: 0.00748) (Q+U: 0.62448) (V: 0.3566)
1600: info string e6e3 (560 ) N: 941 (+ 1) (P: 1.15%) (Q: 0.57949) (U: 0.00166) (Q+U: 0.58115) (V: 0.3566)
3200: info string e6e3 (560 ) N: 2528 (+ 1) (P: 1.15%) (Q: 0.58814) (U: 0.00088) (Q+U: 0.58901) (V: 0.3566)
6400: info string e6e3 (560 ) N: 5695 (+ 1) (P: 1.15%) (Q: 0.61024) (U: 0.00055) (Q+U: 0.61079) (V: 0.3566)
Similarly increasing 1.15% prior towards: 18%, 59%, 79%, 89%.
For reference, here's the other top visited moves at 6400:
SCTR
info string a4c4 (661 ) N: 333 (+ 0) (P: 32.39%) (Q: -0.32429) (U: 0.26372) (Q+U: -0.06057) (V: 0.0365)
info string a6c4 (1148) N: 341 (+ 0) (P: 13.26%) (Q: -0.17237) (U: 0.10547) (Q+U: -0.06690) (V: 0.0656)
info string a6d3 (1145) N: 1161 (+ 0) (P: 17.40%) (Q: -0.19576) (U: 0.04073) (Q+U: -0.15503) (V: 0.0415)
info string a4h4 (666 ) N: 4102 (+ 1) (P: 0.62%) (Q: 0.65462) (U: 0.00041) (Q+U: 0.65503) (V: 0.1151)
Wasp
info string g7f6 (373 ) N: 5 (+ 0) (P: 1.01%) (Q: 0.12200) (U: 0.45651) (Q+U: 0.57851) (V: 0.1824)
info string g7h8 (364 ) N: 125 (+ 0) (P: 21.82%) (Q: 0.13783) (U: 0.47100) (Q+U: 0.60883) (V: 0.2042)
info string c3c2 (1219) N: 569 (+ 0) (P: 72.29%) (Q: 0.26552) (U: 0.34494) (Q+U: 0.61046) (V: 0.3406)
info string e6e3 (560 ) N: 5695 (+ 1) (P: 1.15%) (Q: 0.61024) (U: 0.00055) (Q+U: 0.61079) (V: 0.3566)
At least for these tactical positions where other moves would be significantly worse than the one correct play, increasing visits allows MCTS to eventually give enough visits to the higher prior moves to then find the hidden tactics.
So instead of adjusting noise in various ways, just simply doubling visits should lead to significantly higher visits to the correct move and consequently rapidly increasing the prior training above the noise threshold.
(Increasing visits improves policy head while keeping the existing noise settings, and increasing visits also improves value head while keeping existing temperature without needing #237.)
I reran the positions with 11089, and things definitely seem better than before finding 6 of 12 correct tactical moves with self-play settings and 800 visits.
./lc0 -w idlc0-11089 --verbose-move-stats --policy-softmax-temp=1 --cpuct=1.2 --minibatch-size=1 --futile-search-aversion=0
SCTR
position startpos moves d2d4 d7d5 c1f4 g7g6 e2e3 g8f6 c2c4 c7c5 d4c5 f8g7 b1c3 d8a5 c4d5 f6d5 d1d5 g7c3 b2c3 a5c3 e1e2 c3a1 f4e5 a1b1 e5h8 c8e6 d5d3 b1a2 e2f3 f7f6 h8g7 b8d7 f3g3 a8c8 c5c6 c8c6 d3d4 c6d6 d4b4 d6b6 b4h4 d7c5 h2h3 b6b2 g1e2 a2d5 g3h2 d5e5 e2g3 h7h5 h4d4 e5d4 e3d4 c5b3 g7h6 h5h4 g3e4 g6g5 f1d3 b3d4 h1a1 a7a6 e4c5 b2f2 d3e4 e6f5 e4b7 f2c2 a1a4 d4e2 c5e4 f5e4 b7e4 c2c1 e4d3 e2f4 d3a6 f4h5
info string a4h4 (666 ) N: 0 (+ 0) (P: 0.62%) (Q: -0.82955) (U: 0.21020) (Q+U: -0.61935) (V: -.----)
Wasp
position startpos moves e2e4 g8f6 e4e5 f6d5 c2c4 d5b6 d2d4 d7d6 e5d6 e7d6 g1f3 c8g4 f1e2 f8e7 h2h3 g4f3 e2f3 b8c6 b1a3 e8g8 e1g1 f8e8 b2b3 c6d4 f3b7 a8b8 d1d4 e7f6 d4d1 f6a1 b7c6 e8e6 c6f3 a7a6 a3c2 a1c3 d1d3 d8f6 c2e3 f6e5 f1d1 g7g6 h3h4 e5a5 d3c2 c3g7 h4h5 a5c3 h5h6
info string e6e3 (560 ) N: 0 (+ 0) (P: 1.15%) (Q: -0.65881) (U: 0.39077) (Q+U: -0.26804) (V: -.----)
EXchess
position startpos moves g1f3 g8f6 g2g3 e7e6 f1g2 f8e7 c2c4 d7d5 e1g1 e8g8 d2d4 d5c4 f3e5 c7c5 d4c5 d8c7 e5c4 c7c5 b2b3 f8d8 b1d2 c5c7 c1b2 b8c6 a1c1 a8b8 a2a3 f6d5 b3b4 b7b5 c4a5 c8b7 c1c2 e7f8 d1b1 b8c8 f1c1 c7d7 a5b7 d7b7 d2b3 a7a6 e2e3 b7d7 c2d2 d7e7 h2h4 e7b7 b3c5 b7a8 c5e4 h7h6 e4c5 a6a5 c5e6 f7e6 b1g6 d8d6 g2e4 c6e7 g6h7 g8f7 c1d1 a5b4 e4f3
info string d5f6 (751 ) N: 154 (+ 1) (P: 2.55%) (Q: 0.16138) (U: 0.00555) (Q+U: 0.16693) (V: 0.3750)
Hakkapeliitta
position startpos moves e2e4 c7c5 g1f3 e7e6 d2d4 c5d4 f3d4 b8c6 b1c3 g8f6 d4c6 b7c6 e4e5 f6d5 c3e4 d8c7 f2f4 c7b6 a2a3 f8e7 c2c4 d5e3 d1d3 e3f1 h1f1 c6c5 f1f2 f7f5 e4d6 e7d6 d3d6 b6d6 e5d6 e8f7 b2b4 c8a6 b4b5 a6b7 a3a4 h7h5 a4a5 h5h4 a5a6 b7e4 c1e3 h4h3 g2g3 h8c8 f2a2 a8b8 a1c1 f7g6 e1f1 g6h5 f1f2 h5g4 a2e2 b8b6 e2d2 e4f3 c1c3 g7g6 c3c1 f3g2 c1c3 g2f3 c3a3 f3e4 a3a1 e4f3 a1c1 f3g2 c1a1 g2e4 a1a3 e4f3 a3c3 f3e4 f2g1 e4f3 c3d3 b6b8 d3c3 f3e4 g1f2 b8b6 f2e2 e4g2 e3g1 g2e4 e2e1 e4f3 d2d3 f3g2 e1e2 g2e4 d3d2 b6b8 d2a2 b8b6 a2d2 b6b8 g1e3 b8b6 c3c1 e4f3 e2f1 b6b8 f1f2 b8b6 c1e1 f3e4 e1d1 e4f3 d1a1 b6b8 f2g1 f3e4 g1f2 b8b6 a1a2 e4f3 a2b2 b6b8 b2b3 b8b6 b3d3 f3e4 d3b3 e4f3 b3b1 f3e4 b1e1 e4f3 f2g1 f3e4 e1f1 e4g2 f1e1 g2f3 g1f2 f3e4 e1f1 e4f3 f2g1 f3e4 g1f2 e4f3 f1g1 f3e4 g1d1 e4f3 f2e1
info string f3d1 (1321) N: 799 (+ 1) (P: 92.58%) (Q: -0.02764) (U: 0.03920) (Q+U: 0.01156) (V: 0.1661)
iCE
position startpos moves e2e4 c7c6 g1f3 d7d5 e4e5 c6c5 f1e2 b8c6 e1g1 c8g4 c2c4 d5c4 b1a3 e7e6 a3c4 f8e7 d2d3 g8h6 c1h6 g7h6 d1d2 h6h5 d2f4 h8g8 f1e1 d8d7 a1d1 e8c8 f4f7 h7h6 f7h7 h5h4 h7h6 d8f8 h6h7 c8b8 c4e3 g4f3 e2f3 c6e5 f3e4 e5f7 f2f4 d7c7 e1f1 e7f6 d1e1 f7d6 h7c7 b8c7 b2b3 f6d4 g1h1 b7b5 e1e2 a7a5 e3c2 d4b2 e4f3 c7d7 c2e3 b2d4 a2a4 b5a4 b3a4 f8f4
info string f3c6 (592 ) N: 746 (+ 1) (P: 10.44%) (Q: 0.13812) (U: 0.00474) (Q+U: 0.14285) (V: 0.1866)
Bobcat
position startpos moves d2d4 d7d5 g1f3 c7c6 c2c4 g8f6 b1c3 d5c4 a2a4 c8f5 e2e3 e7e6 f1c4 b8d7 d1b3 d8b6 a4a5 b6b3 c4b3 f5d3 b3d1 f8d6 d1e2 d3g6 e1g1 e8g8 c1d2 h7h6 f1c1 a7a6 c3a4 f6e4 d2e1 f8e8 g1f1 a8d8 f3d2 e4d2 e1d2 e6e5 d4e5 d7e5 d2c3 e5d7 c1d1 d6e7 a1c1 d7f6 c3d4 f6d7 d4c3 d7f6 c3d4 f6d7 h2h3 g6f5 e2d3 f5e6 d3c4 e6f5 f2f3 c6c5 d4c3 e7g5 g2g4 f5e6 c4e6 e8e6 f3f4 g5e7 f1e2 e6c6 d1d5 c6d6 d5d6 e7d6 c1d1 d6e7 b2b3 f7f6 h3h4 g8f7 h4h5 f7e8 e3e4 d8c8 e2d3 c8c6 d3c4 c6e6 d1e1 e6c6 e4e5 f6e5 f4e5 d7f8 a4b6 f8e6 c4d5 e6c7 d5e4 e8f7 e4f5 g7g6 f5e4 c7b5 c3d2 b5d4 e1b1 d4e2 b1f1 f7e8 f1f3 g6h5 g4h5 e2d4 f3g3 e7f8 b6c4 e8f7 e4d5 d4b5 g3d3 f7e8 d2e3 c6c7 c4d6 b5d6 e5d6 f8d6
info string e3h6 (561 ) N: 563 (+ 0) (P: 11.17%) (Q: 0.33399) (U: 0.00672) (Q+U: 0.34071) (V: 0.6047)
Houdini
position startpos moves d2d4 e7e6 c2c4 f8b4 c1d2 b4e7 e2e4 d7d5 e4e5 c7c5 d1g4 e7f8 d4c5 h7h5 g4g3 h5h4 g3a3 b8d7 g1f3 f8c5 b2b4 c5b6 d2g5 g8e7 a3b2 h8h5 c4d5 e6d5 f1b5 e8f8 e1g1 d7e5 b2e5 f7f6 e5f4 b6c7 f4e3 f6g5 b1c3 d8d6 b5d3 c7b6 e3e2 h4h3 f1e1 g5g4 f3e5 h5g5 e5g6 g5g6 d3g6 c8d7 g6h5 a8c8 a1c1 c8c4
info string e2e7 (330 ) N: 492 (+ 1) (P: 4.66%) (Q: 0.76301) (U: 0.00320) (Q+U: 0.76621) (V: -0.6224)
Naum
position startpos moves d2d4 f7f5 g1f3 e7e6 g2g3 b8c6 f1g2 g8f6 e1g1 d7d5 c2c4 d5c4 d1a4 c8d7 a4c4 f8d6 b1c3 e8g8 c1g5 h7h6 g5f6 d8f6 e2e4 c6a5 c4e2 f6g6 a1d1 g6h5 e4e5 d6b4 d4d5 b4c3 b2c3 a8d8 f1e1 c7c5 c3c4 h5g4 d1c1 f5f4 h2h3 g4g6 g3g4 h6h5 f3h2 h5g4 h2g4 g6g5 g1h2 d8e8 g2f3 g8h8 e1g1 g5h4 e2d2 b7b6 c1c3 e6d5 f3d5 h4h5 c3f3 a5c6 e5e6 c6d4 e6d7
info string h5d5 (882 ) N: 435 (+ 1) (P: 4.34%) (Q: -0.28057) (U: 0.00337) (Q+U: -0.27720) (V: -0.6316)
Scorpio
position startpos moves g1f3 d7d5 e2e3 c7c5 d2d4 g8f6 c2c4 c5d4 e3d4 b8c6 c4d5 f6d5 b1c3 g7g6 f1c4 d5b6 c4b3 f8g7 e1g1 e8g8 d4d5 c6a5 f1e1 a5b3 a2b3 c8g4 h2h3 g4f3 d1f3 f8e8 c1e3 g7c3 e3b6 d8b6 b2c3 b6b3 a1b1 b3a3 b1b7 a7a5 b7d7 a8d8
info string e1e7 (120 ) N: 726 (+ 1) (P: 17.35%) (Q: 0.19556) (U: 0.00808) (Q+U: 0.20365) (V: 0.1929)
Protector
position startpos moves d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 d5c4 e2e4 f8b4 c1g5 h7h6 g5f6 d8f6 f1c4 c7c5 e1g1 c5d4 e4e5 f6d8 d1d4 d8d4 f3d4 e8e7
info string d4f5 (763 ) N: 0 (+ 0) (P: 0.65%) (Q: -0.87489) (U: 0.22101) (Q+U: -0.65388) (V: -.----)
Vajolet
position startpos moves e2e4 c7c5 g1e2 b8c6 d2d4 c5d4 e2d4 d7d6 c2c4 e7e5 d4c2 f8e7 b1c3 g8f6 f1e2 c8e6 e1g1 e8g8 b2b3 a8c8 c1e3 f6d7 d1d2 f7f5 e4f5 e6f5 a1d1 d8e8 e2d3 f5e6 f2f4 d7f6 f4f5 e6f7 d3e2 c8d8 c3d5 b7b6 g2g4 f7d5 c4d5 c6b8 c2a3 d8c8 a3b5
info string f6g4 (590 ) N: 109 (+ 0) (P: 11.63%) (Q: -0.40228) (U: 0.03585) (Q+U: -0.36642) (V: -0.2968)
Cheng
position startpos moves e2e4 e7e6 c2c4 d7d5 c4d5 e6d5 e4d5 g8f6 f1b5 c8d7 b5c4 d8e7 g1e2 e7e4 d2d3 e4g2 h1g1 g2h2 c1f4 h2h5 d1b3 b7b5 c4b5 f6d5 g1g5 h5h1 g5g1 h1h5 f4g5 d5b6 b1c3 f8d6 e1c1 a7a6 b5d7 b8d7 c3e4 e8g8 e2c3 d6h2 g1h1 d7e5 f2f4 e5f3 d1f1 f3g5 f4g5
info string h2f4 (1582) N: 0 (+ 0) (P: 0.07%) (Q: -1.50398) (U: 0.02522) (Q+U: -1.47876) (V: -.----)
If using the default match settings for cpuct and softmax, 11089 finds all except one:
11089 sctr info string a4h4 (666 ) N: 624 (+ 1) (P: 2.41%) (Q: 0.61425) (U: 0.00369) (Q+U: 0.61794) (V: 0.1151)
11089 wasp info string e6e3 (560 ) N: 660 (+ 1) (P: 4.07%) (Q: 0.61644) (U: 0.00591) (Q+U: 0.62235) (V: 0.3566)
11089 exchess info string d5f6 (751 ) N: 669 (+ 1) (P: 3.70%) (Q: 0.15048) (U: 0.00530) (Q+U: 0.15578) (V: 0.3750)
11089 hakkapeliitta info string f3d1 (1321) N: 699 (+ 1) (P: 36.16%) (Q: 0.00541) (U: 0.04957) (Q+U: 0.05498) (V: 0.1661)
11089 ice info string f3c6 (592 ) N: 693 (+ 1) (P: 8.62%) (Q: 0.25334) (U: 0.01193) (Q+U: 0.26526) (V: 0.1866)
11089 bobcat info string e3h6 (561 ) N: 550 (+ 0) (P: 11.58%) (Q: 0.45141) (U: 0.02020) (Q+U: 0.47161) (V: 0.6047)
11089 houdini info string e2e7 (330 ) N: 715 (+ 1) (P: 5.11%) (Q: 0.66406) (U: 0.00686) (Q+U: 0.67092) (V: -0.6224)
11089 naum info string h5d5 (882 ) N: 717 (+ 1) (P: 5.22%) (Q: 0.00369) (U: 0.00697) (Q+U: 0.01067) (V: -0.6316)
11089 scorpio info string e1e7 (120 ) N: 655 (+ 1) (P: 12.00%) (Q: 0.30091) (U: 0.01755) (Q+U: 0.31846) (V: 0.1929)
11089 protector info string d4f5 (763 ) N: 484 (+ 0) (P: 2.01%) (Q: 0.11955) (U: 0.00397) (Q+U: 0.12353) (V: -0.0408)
11089 vajolet info string f6g4 (590 ) N: 262 (+ 1) (P: 9.53%) (Q: -0.30616) (U: 0.03468) (Q+U: -0.27148) (V: -0.2968)
11089 cheng info string h2f4 (1582) N: 0 (+ 0) (P: 0.91%) (Q: -1.46775) (U: 0.87181) (Q+U: -0.59594) (V: -.----)
And here's the result with latest test20:
self-play settings
20633 sctr info string a4h4 (666 ) N: 0 (+ 0) (P: 0.46%) (Q: -1.26248) (U: 0.15508) (Q+U: -1.10740) (V: -.----)
20633 wasp info string e6e3 (560 ) N: 0 (+ 0) (P: 1.31%) (Q: -0.71427) (U: 0.44550) (Q+U: -0.26876) (V: -.----)
20633 exchess info string d5f6 (751 ) N: 0 (+ 0) (P: 1.63%) (Q: -1.29837) (U: 0.55381) (Q+U: -0.74456) (V: -.----)
20633 hakkapeliitta info string f3d1 (1321) N: 796 (+ 2) (P: 86.87%) (Q: 0.09656) (U: 0.03688) (Q+U: 0.13343) (V: 0.0620)
20633 ice info string f3c6 (592 ) N: 751 (+ 1) (P: 22.72%) (Q: 0.21161) (U: 0.01024) (Q+U: 0.22185) (V: 0.2895)
20633 bobcat info string e3h6 (561 ) N: 0 (+ 0) (P: 0.57%) (Q: -1.55809) (U: 0.19403) (Q+U: -1.36406) (V: -.----)
20633 houdini info string e2e7 (330 ) N: 0 (+ 0) (P: 1.46%) (Q: -0.88607) (U: 0.49545) (Q+U: -0.39062) (V: -.----)
20633 naum info string h5d5 (882 ) N: 0 (+ 0) (P: 0.39%) (Q: -1.54053) (U: 0.13308) (Q+U: -1.40745) (V: -.----)
20633 scorpio info string e1e7 (120 ) N: 0 (+ 0) (P: 1.11%) (Q: -0.96234) (U: 0.37770) (Q+U: -0.58464) (V: -.----)
20633 protector info string d4f5 (763 ) N: 0 (+ 0) (P: 0.51%) (Q: -0.84197) (U: 0.17235) (Q+U: -0.66962) (V: -.----)
20633 vajolet info string f6g4 (590 ) N: 0 (+ 0) (P: 1.11%) (Q: -0.97088) (U: 0.37576) (Q+U: -0.59512) (V: -.----)
20633 cheng info string h2f4 (1582) N: 0 (+ 0) (P: 0.39%) (Q: -1.20504) (U: 0.13392) (Q+U: -1.07112) (V: -.----)
match settings
20633 sctr info string a4h4 (666 ) N: 540 (+ 1) (P: 1.94%) (Q: 0.39817) (U: 0.00345) (Q+U: 0.40162) (V: -0.2688)
20633 wasp info string e6e3 (560 ) N: 394 (+ 0) (P: 3.25%) (Q: 0.29576) (U: 0.00791) (Q+U: 0.30368) (V: 0.4163)
20633 exchess info string d5f6 (751 ) N: 567 (+ 1) (P: 3.48%) (Q: 0.02776) (U: 0.00588) (Q+U: 0.03363) (V: 0.3233)
20633 hakkapeliitta info string f3d1 (1321) N: 687 (+ 1) (P: 30.92%) (Q: 0.08999) (U: 0.04313) (Q+U: 0.13312) (V: 0.0620)
20633 ice info string f3c6 (592 ) N: 618 (+ 1) (P: 10.49%) (Q: 0.13358) (U: 0.01626) (Q+U: 0.14984) (V: 0.2895)
20633 bobcat info string e3h6 (561 ) N: 476 (+ 1) (P: 3.95%) (Q: 0.36131) (U: 0.00795) (Q+U: 0.36925) (V: 0.4301)
20633 houdini info string e2e7 (330 ) N: 621 (+ 1) (P: 3.09%) (Q: 0.38665) (U: 0.00476) (Q+U: 0.39141) (V: -0.2205)
20633 naum info string h5d5 (882 ) N: 509 (+ 1) (P: 1.66%) (Q: -0.08083) (U: 0.00313) (Q+U: -0.07770) (V: -0.4912)
20633 scorpio info string e1e7 (120 ) N: 555 (+ 1) (P: 3.31%) (Q: 0.28537) (U: 0.00571) (Q+U: 0.29108) (V: -0.2923)
20633 protector info string d4f5 (763 ) N: 317 (+ 1) (P: 1.43%) (Q: 0.02249) (U: 0.00431) (Q+U: 0.02679) (V: -0.2807)
20633 vajolet info string f6g4 (590 ) N: 52 (+ 0) (P: 2.70%) (Q: -0.21902) (U: 0.04899) (Q+U: -0.17003) (V: -0.1830)
20633 cheng info string h2f4 (1582) N: 68 (+ 0) (P: 1.60%) (Q: -0.29140) (U: 0.02222) (Q+U: -0.26918) (V: -0.3185)
Interesting to see how different the initial network V can be from the searched Q in these positions.
Porting to lc0 of lczero issues https://github.com/glinscott/leela-chess/issues/698 and https://github.com/glinscott/leela-chess/issues/699 using the same game for analysis:
CCLS SCTR vs id359 game 1
Trying to find Rxh4 https://clips.twitch.tv/NimbleLazyNewtPRChase: ``` position startpos moves d2d4 d7d5 c1f4 g7g6 e2e3 g8f6 c2c4 c7c5 d4c5 f8g7 b1c3 d8a5 c4d5 f6d5 d1d5 g7c3 b2c3 a5c3 e1e2 c3a1 f4e5 a1b1 e5h8 c8e6 d5d3 b1a2 e2f3 f7f6 h8g7 b8d7 f3g3 a8c8 c5c6 c8c6 d3d4 c6d6 d4b4 d6b6 b4h4 d7c5 h2h3 b6b2 g1e2 a2d5 g3h2 d5e5 e2g3 h7h5 h4d4 e5d4 e3d4 c5b3 g7h6 h5h4 g3e4 g6g5 f1d3 b3d4 h1a1 a7a6 e4c5 b2f2 d3e4 e6f5 e4b7 f2c2 a1a4 d4e2 c5e4 f5e4 b7e4 c2c1 e4d3 e2f4 d3a6 f4h5 ``` ![screen shot 2018-05-31 at 10 38 50 am](https://user-images.githubusercontent.com/438537/40798095-1038c912-64bf-11e8-9e35-b9bc479b0b22.png)Here's the history of networks from 364 going back 10 at a time and what they thought of the winning move Rxh4 / a4h4 (focus on V and P for now):
Generally, the prior for this winning move is very low at under 1%, and the value is also unfavorable for white, so search will normally avoid it. This is tricky for tactics to be learned where playing an initially bad move opens up a better outcome.
That's where noise comes in to trick search into visiting more, and here's 50 runs of
./lc0 --weights=id359 --verbose-move-stats --noise --no-smart-pruning
withgo nodes 800
from the aboveposition startpos …
:Here, 13 of 50 games would have produced valuable training data, so noise is indeed working, but the majority is training to avoid the correct move. Averaging this training data for the move across 50 games should cause P to move towards 16.3% (= 6523 / ~800 / 50). But then combined with training data from other games, the networks have learned to keep avoiding this move.
As from the other issue: The premise is that for a self-play to end up in a learnable board state, it seems unfortunate that it misses the opportunity to generate valuable training data for the correct move more often than not. Clearly, AZ's numbers are good enough to eventually generate strong networks, but perhaps training search could be better optimized?
I've rerun the analysis with lc0 and 50 games each configuration from the above board state to measure the average training data for the expected tactic:
Testing patches for visit twice and negative fpu
```diff diff --git a/src/mcts/search.cc b/src/mcts/search.cc --- a/src/mcts/search.cc +++ b/src/mcts/search.cc @@ -650,4 +650,9 @@ Node* Search::PickNodeToExtend(Node* node, PositionHistory* history) { for (Node* iter : node->Children()) { if (is_root_node) { + if (kNoise && iter->GetN() < 2) { + node = iter; + possible_moves = 2; // avoid "only one possible move" short circuit + break; + } // If there's no chance to catch up the currently best node with // remaining playouts, not consider it. ``` ```diff diff --git a/src/mcts/search.cc b/src/mcts/search.cc --- a/src/mcts/search.cc +++ b/src/mcts/search.cc @@ -645,5 +645,5 @@ Node* Search::PickNodeToExtend(Node* node, PositionHistory* history) { float parent_q = (is_root_node && kNoise) - ? -node->GetQ(0, kExtraVirtualLoss) + ? -node->GetQ(0, kExtraVirtualLoss) + kFpuReduction : -node->GetQ(0, kExtraVirtualLoss) - kFpuReduction * std::sqrt(node->GetVisitedPolicy()); ```I only ran one "visit each root move twice" as even with the default search parameters, it generally searches much deeper after being nudged over with the forced breadth exploration. This is true across all the previously listed networks from id364 to id124 above, and the output with high Ns are with "visit twice."
Is there an appropriate level of average tactic training? It looks like the current 16.3% is too low to outweigh the other training data. A related question is how often are self-play games getting into learnable states, but I don't have a good way to answer that.