Improve training data for learning tactics

LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.

GNU General Public License v3.0

2.45k stars 530 forks source link

Improve training data for learning tactics #8

Closed Mardak closed 5 years ago

Mardak commented 6 years ago

Porting to lc0 of lczero issues https://github.com/glinscott/leela-chess/issues/698 and https://github.com/glinscott/leela-chess/issues/699 using the same game for analysis:

CCLS SCTR vs id359 game 1

Trying to find Rxh4 https://clips.twitch.tv/NimbleLazyNewtPRChase: ``` position startpos moves d2d4 d7d5 c1f4 g7g6 e2e3 g8f6 c2c4 c7c5 d4c5 f8g7 b1c3 d8a5 c4d5 f6d5 d1d5 g7c3 b2c3 a5c3 e1e2 c3a1 f4e5 a1b1 e5h8 c8e6 d5d3 b1a2 e2f3 f7f6 h8g7 b8d7 f3g3 a8c8 c5c6 c8c6 d3d4 c6d6 d4b4 d6b6 b4h4 d7c5 h2h3 b6b2 g1e2 a2d5 g3h2 d5e5 e2g3 h7h5 h4d4 e5d4 e3d4 c5b3 g7h6 h5h4 g3e4 g6g5 f1d3 b3d4 h1a1 a7a6 e4c5 b2f2 d3e4 e6f5 e4b7 f2c2 a1a4 d4e2 c5e4 f5e4 b7e4 c2c1 e4d3 e2f4 d3a6 f4h5 ``` ![screen shot 2018-05-31 at 10 38 50 am](https://user-images.githubusercontent.com/438537/40798095-1038c912-64bf-11e8-9e35-b9bc479b0b22.png)

Here's the history of networks from 364 going back 10 at a time and what they thought of the winning move Rxh4 / a4h4 (focus on V and P for now):

id364 a4h4  (666 ) N:     759 (+ 0) (V: -36.58%) (P:  0.47%) (Q:  0.76777) (U: 0.00021) (Q+U:  0.76798) 
id354 a4h4  (666 ) N:     801 (+ 0) (V: -41.09%) (P:  0.23%) (Q:  0.78342) (U: 0.00010) (Q+U:  0.78353) 
id344 a4h4  (666 ) N:     747 (+ 0) (V: -43.49%) (P:  0.15%) (Q:  0.76876) (U: 0.00007) (Q+U:  0.76883) 
id334 a4h4  (666 ) N:     760 (+ 0) (V: -39.09%) (P:  0.11%) (Q:  0.77163) (U: 0.00005) (Q+U:  0.77168) 
id324 a4h4  (666 ) N:     752 (+ 0) (V: -40.29%) (P:  0.21%) (Q:  0.76674) (U: 0.00010) (Q+U:  0.76683) 
id314 a4h4  (666 ) N:     725 (+ 0) (V: -40.58%) (P:  0.18%) (Q:  0.76648) (U: 0.00009) (Q+U:  0.76656) 
id304 a4h4  (666 ) N:     779 (+ 0) (V: -46.39%) (P:  0.18%) (Q:  0.76410) (U: 0.00008) (Q+U:  0.76418) 
id294 a4h4  (666 ) N:     812 (+ 0) (V: -44.82%) (P:  0.17%) (Q:  0.76923) (U: 0.00007) (Q+U:  0.76930) 
id284 a4h4  (666 ) N:     756 (+ 0) (V:  -3.65%) (P:  0.29%) (Q:  0.76365) (U: 0.00013) (Q+U:  0.76379) 
id274 a4h4  (666 ) N:     775 (+ 0) (V:  16.01%) (P:  0.25%) (Q:  0.78128) (U: 0.00011) (Q+U:  0.78139) 
id264 a4h4  (666 ) N:     708 (+ 0) (V: -47.48%) (P:  0.17%) (Q:  0.74895) (U: 0.00008) (Q+U:  0.74903) 
id254 a4h4  (666 ) N:     721 (+ 0) (V: -19.61%) (P:  0.25%) (Q:  0.73499) (U: 0.00012) (Q+U:  0.73511) 
id244 a4h4  (666 ) N:     718 (+ 0) (V: -34.57%) (P:  0.23%) (Q:  0.69746) (U: 0.00011) (Q+U:  0.69756) 
id234 a4h4  (666 ) N:     750 (+ 0) (V: -21.15%) (P:  0.58%) (Q:  0.68519) (U: 0.00027) (Q+U:  0.68546) 
id224 a4h4  (666 ) N:     762 (+ 0) (V: -16.51%) (P:  0.38%) (Q:  0.73801) (U: 0.00017) (Q+U:  0.73818) 
id214 a4h4  (666 ) N:     745 (+ 0) (V: -13.26%) (P:  0.40%) (Q:  0.74920) (U: 0.00018) (Q+U:  0.74938) 
id204 a4h4  (666 ) N:     729 (+ 0) (V: -28.40%) (P:  0.31%) (Q:  0.53719) (U: 0.00015) (Q+U:  0.53734) 
id194 a4h4  (666 ) N:     741 (+ 0) (V:  -3.47%) (P:  0.51%) (Q:  0.74044) (U: 0.00024) (Q+U:  0.74068) 
id184 a4h4  (666 ) N:     745 (+ 0) (V: -23.81%) (P:  0.44%) (Q:  0.72124) (U: 0.00020) (Q+U:  0.72144) 
id174 a4h4  (666 ) N:     724 (+ 0) (V:  16.63%) (P:  0.25%) (Q:  0.66271) (U: 0.00012) (Q+U:  0.66283) 
id164 a4h4  (666 ) N:     715 (+ 0) (V:  -5.59%) (P:  0.91%) (Q:  0.65085) (U: 0.00043) (Q+U:  0.65129) 
id154 a4h4  (666 ) N:     711 (+ 0) (V: -21.75%) (P:  0.57%) (Q:  0.62195) (U: 0.00027) (Q+U:  0.62222) 
id144 a4h4  (666 ) N:     723 (+ 0) (V: -28.62%) (P:  0.59%) (Q:  0.62137) (U: 0.00028) (Q+U:  0.62165)
id134 a4h4  (666 ) N:     492 (+ 0) (V: -47.86%) (P:  0.61%) (Q:  0.53013) (U: 0.00042) (Q+U:  0.53055) 
id124 a4h4  (666 ) N:     636 (+ 0) (V: -37.09%) (P:  0.39%) (Q:  0.62892) (U: 0.00021) (Q+U:  0.62912)

Generally, the prior for this winning move is very low at under 1%, and the value is also unfavorable for white, so search will normally avoid it. This is tricky for tactics to be learned where playing an initially bad move opens up a better outcome.

That's where noise comes in to trick search into visiting more, and here's 50 runs of ./lc0 --weights=id359 --verbose-move-stats --noise --no-smart-pruning with go nodes 800 from the above position startpos …:

info string a4h4  (666 ) N:       0 (+ 0) (V:   0.00%) (P:  0.25%) (Q: -0.29468) (U: 0.08479) (Q+U: -0.20989) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.25%) (Q: -0.38945) (U: 0.04206) (Q+U: -0.34739) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.25%) (Q: -0.38945) (U: 0.04234) (Q+U: -0.34711) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.25%) (Q: -0.38945) (U: 0.04248) (Q+U: -0.34696) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.25%) (Q: -0.38945) (U: 0.04262) (Q+U: -0.34683) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.25%) (Q: -0.38945) (U: 0.04310) (Q+U: -0.34634) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.25%) (Q: -0.38945) (U: 0.04329) (Q+U: -0.34616) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.26%) (Q: -0.38945) (U: 0.04338) (Q+U: -0.34607) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.26%) (Q: -0.38945) (U: 0.04345) (Q+U: -0.34600) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.27%) (Q: -0.38945) (U: 0.04560) (Q+U: -0.34384) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.27%) (Q: -0.38945) (U: 0.04596) (Q+U: -0.34348) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.30%) (Q: -0.38945) (U: 0.05126) (Q+U: -0.33818) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.33%) (Q: -0.38945) (U: 0.05541) (Q+U: -0.33404) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.34%) (Q: -0.38945) (U: 0.05772) (Q+U: -0.33173) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.36%) (Q: -0.38945) (U: 0.06209) (Q+U: -0.32735) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.38%) (Q: -0.38945) (U: 0.06505) (Q+U: -0.32440) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.38%) (Q: -0.38945) (U: 0.06522) (Q+U: -0.32422) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.39%) (Q: -0.38945) (U: 0.06538) (Q+U: -0.32406) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.40%) (Q: -0.38945) (U: 0.06868) (Q+U: -0.32077) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.40%) (Q: -0.38945) (U: 0.06874) (Q+U: -0.32070) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.41%) (Q: -0.38945) (U: 0.06946) (Q+U: -0.31998) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.41%) (Q: -0.38945) (U: 0.06971) (Q+U: -0.31973) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.43%) (Q: -0.38945) (U: 0.07282) (Q+U: -0.31662) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.47%) (Q: -0.38945) (U: 0.07992) (Q+U: -0.30952) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.48%) (Q: -0.38945) (U: 0.08057) (Q+U: -0.30888) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.48%) (Q: -0.38945) (U: 0.08158) (Q+U: -0.30786) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.48%) (Q: -0.38945) (U: 0.08243) (Q+U: -0.30702) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.53%) (Q: -0.38945) (U: 0.08946) (Q+U: -0.29998) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.54%) (Q: -0.38945) (U: 0.09278) (Q+U: -0.29667) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.60%) (Q: -0.38945) (U: 0.10182) (Q+U: -0.28763) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.61%) (Q: -0.38945) (U: 0.10405) (Q+U: -0.28540) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.64%) (Q: -0.38945) (U: 0.10879) (Q+U: -0.28066) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.83%) (Q: -0.38945) (U: 0.14161) (Q+U: -0.24783) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.88%) (Q: -0.38945) (U: 0.14977) (Q+U: -0.23967) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.90%) (Q: -0.38945) (U: 0.15223) (Q+U: -0.23722) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  0.96%) (Q: -0.38945) (U: 0.16305) (Q+U: -0.22639) 
info string a4h4  (666 ) N:       1 (+ 0) (V: -38.94%) (P:  1.00%) (Q: -0.38945) (U: 0.17040) (Q+U: -0.21904) 
info string a4h4  (666 ) N:     132 (+ 0) (V: -38.94%) (P:  1.29%) (Q:  0.76707) (U: 0.00331) (Q+U:  0.77039) 
info string a4h4  (666 ) N:     132 (+ 0) (V: -38.94%) (P:  1.30%) (Q:  0.76707) (U: 0.00336) (Q+U:  0.77043) 
info string a4h4  (666 ) N:     527 (+ 0) (V: -38.94%) (P:  1.39%) (Q:  0.77280) (U: 0.00092) (Q+U:  0.77373) 
info string a4h4  (666 ) N:     215 (+ 0) (V: -38.94%) (P:  1.48%) (Q:  0.77114) (U: 0.00236) (Q+U:  0.77350) 
info string a4h4  (666 ) N:     484 (+ 0) (V: -38.94%) (P:  1.50%) (Q:  0.77353) (U: 0.00107) (Q+U:  0.77460) 
info string a4h4  (666 ) N:     289 (+ 0) (V: -38.94%) (P:  1.71%) (Q:  0.77238) (U: 0.00202) (Q+U:  0.77440) 
info string a4h4  (666 ) N:     536 (+ 0) (V: -38.94%) (P:  1.81%) (Q:  0.77391) (U: 0.00117) (Q+U:  0.77507) 
info string a4h4  (666 ) N:     714 (+ 0) (V: -38.94%) (P:  2.39%) (Q:  0.77523) (U: 0.00123) (Q+U:  0.77645) 
info string a4h4  (666 ) N:     649 (+ 0) (V: -38.94%) (P:  2.97%) (Q:  0.77412) (U: 0.00158) (Q+U:  0.77570) 
info string a4h4  (666 ) N:     643 (+65) (V: -38.94%) (P:  3.22%) (Q:  0.77441) (U: 0.00156) (Q+U:  0.77597) 
info string a4h4  (666 ) N:     714 (+ 0) (V: -38.94%) (P:  3.66%) (Q:  0.77523) (U: 0.00181) (Q+U:  0.77703) 
info string a4h4  (666 ) N:     714 (+ 0) (V: -38.94%) (P:  5.44%) (Q:  0.77523) (U: 0.00263) (Q+U:  0.77786) 
info string a4h4  (666 ) N:     738 (+ 0) (V: -38.94%) (P:  8.65%) (Q:  0.77528) (U: 0.00404) (Q+U:  0.77931)

Here, 13 of 50 games would have produced valuable training data, so noise is indeed working, but the majority is training to avoid the correct move. Averaging this training data for the move across 50 games should cause P to move towards 16.3% (= 6523 / ~800 / 50). But then combined with training data from other games, the networks have learned to keep avoiding this move.

As from the other issue: The premise is that for a self-play to end up in a learnable board state, it seems unfortunate that it misses the opportunity to generate valuable training data for the correct move more often than not. Clearly, AZ's numbers are good enough to eventually generate strong networks, but perhaps training search could be better optimized?

I've rerun the analysis with lc0 and 50 games each configuration from the above board state to measure the average training data for the expected tactic:

Testing patches for visit twice and negative fpu

```diff diff --git a/src/mcts/search.cc b/src/mcts/search.cc --- a/src/mcts/search.cc +++ b/src/mcts/search.cc @@ -650,4 +650,9 @@ Node* Search::PickNodeToExtend(Node* node, PositionHistory* history) { for (Node* iter : node->Children()) { if (is_root_node) { + if (kNoise && iter->GetN() < 2) { + node = iter; + possible_moves = 2; // avoid "only one possible move" short circuit + break; + } // If there's no chance to catch up the currently best node with // remaining playouts, not consider it. ``` ```diff diff --git a/src/mcts/search.cc b/src/mcts/search.cc --- a/src/mcts/search.cc +++ b/src/mcts/search.cc @@ -645,5 +645,5 @@ Node* Search::PickNodeToExtend(Node* node, PositionHistory* history) { float parent_q = (is_root_node && kNoise) - ? -node->GetQ(0, kExtraVirtualLoss) + ? -node->GetQ(0, kExtraVirtualLoss) + kFpuReduction : -node->GetQ(0, kExtraVirtualLoss) - kFpuReduction * std::sqrt(node->GetVisitedPolicy()); ```

epsilon	alpha	fpu	twice	average tactic training
0.25	0.3	0.0	no	16.3%
0.25	3.0	0.0	no	27.6%
0.5	0.3	0.0	no	33.1%
0.5	3.0	0.0	no	60.0%
0.25	0.3	-0.2	no	18.5%
0.25	3.0	-0.2	no	28.3%
0.5	0.3	-0.2	no	34.8%
0.5	3.0	-0.2	no	59.6%
0.25	0.3	0.0	yes	92.1%

I only ran one "visit each root move twice" as even with the default search parameters, it generally searches much deeper after being nudged over with the forced breadth exploration. This is true across all the previously listed networks from id364 to id124 above, and the output with high Ns are with "visit twice."

Is there an appropriate level of average tactic training? It looks like the current 16.3% is too low to outweigh the other training data. A related question is how often are self-play games getting into learnable states, but I don't have a good way to answer that.

killerducky commented 6 years ago

Averaging this training data for the move across 50 games should cause P to move towards 16.3%

I think this is probably good enough. As it moves towards 16.3% it will accelerate and move up even faster.

Also generally I think we should not do any of these sort of changes that try to improve on the paper until after we fix things that are probably wrong such as rule50.

Mardak commented 6 years ago

As it moves towards 16.3% it will accelerate and move up even faster.

Yes, in fact from all the runs, once the prior for this particular move reaches P: 1.28%, it'll start driving at least 100 visits of 800 towards it. Similarly, once it gets to P: 2.25%, over 700 visits will go to it for near 90% average tactic training. I.e., networks are not in a virtuous cycle of self-learning for this tactic yet.

However, if you look at the data showing the priors for this move across the various network ids, the prior has stayed around 0.3%. And from that same data, it shows that nearly all those networks would have put over 700 visits into the move if only search initially had 2 visits to it.

That means even with the current "16.3% average tactic training," there is way more training data that drives the prior towards 0.3% instead of higher. This means across 250 network generations, the existing "16.3%" noise has been unable to get it to learn this tactic when 2 visits would have.

The new network prior approaches ((16.3% * number of similar board state) + (0% * other board states)) / total board states. If someone has any suggestions to how to measure the number of these learnable board states, that would be great. (Although, I doubt we would do anything proactively to try to increase that number of similar board states, but we might do something to increase the 16.3%.)

Mardak commented 6 years ago

@ASilver requested running with 3.1 PUCT, and I see that the latest lc0 master https://github.com/LeelaChessZero/lc0/commit/2321011913d8a7914d8177e82ceaf34fbe2d6ee8 uses that and gets 24.9%. The earlier runs were against then-next https://github.com/LeelaChessZero/lc0/commit/50542694af7c5e50d8c4d5a60f57a54d9247cf88 with 16.3% average tactic training.

Here's a graph of testing various PUCT at noised root:

chart-1

diff --git a/src/mcts/search.cc b/src/mcts/search.cc
--- a/src/mcts/search.cc
+++ b/src/mcts/search.cc
@@ -677 +677 @@ std::pair<Node*, bool> Search::PickNodeToExtend(Node* node,
-    float factor = kCpuct * std::sqrt(std::max(node->GetChildrenVisits(), 1u));
+    float factor = (is_root_node && kNoise ? 3.5f : kCpuct) * std::sqrt(std::max(node->GetChildrenVisits(), 1u));

Mardak commented 6 years ago

Here's some more analysis on other board states from https://github.com/glinscott/leela-chess/issues/698#issuecomment-393666516:

SCTR vs id359 game 1 (this is from OP)

![screen shot 2018-05-31 at 10 38 50 am](https://user-images.githubusercontent.com/438537/40798095-1038c912-64bf-11e8-9e35-b9bc479b0b22.png) ``` position startpos moves d2d4 d7d5 c1f4 g7g6 e2e3 g8f6 c2c4 c7c5 d4c5 f8g7 b1c3 d8a5 c4d5 f6d5 d1d5 g7c3 b2c3 a5c3 e1e2 c3a1 f4e5 a1b1 e5h8 c8e6 d5d3 b1a2 e2f3 f7f6 h8g7 b8d7 f3g3 a8c8 c5c6 c8c6 d3d4 c6d6 d4b4 d6b6 b4h4 d7c5 h2h3 b6b2 g1e2 a2d5 g3h2 d5e5 e2g3 h7h5 h4d4 e5d4 e3d4 c5b3 g7h6 h5h4 g3e4 g6g5 f1d3 b3d4 h1a1 a7a6 e4c5 b2f2 d3e4 e6f5 e4b7 f2c2 a1a4 d4e2 c5e4 f5e4 b7e4 c2c1 e4d3 e2f4 d3a6 f4h5 expect Rxh4 -> 366 (V: 88.55%) (N: 0.54%) PV: Rxh4 gxh4 Bxc1 Ng3 Bc4 Kd7 Bf4 Nf5 g3 e5 ```

id359 vs Wasp game 3

![screen shot 2018-06-03 at 3 25 12 pm](https://user-images.githubusercontent.com/438537/40891875-8b7d4082-6742-11e8-99bd-544969a3defa.png) ``` position startpos moves e2e4 g8f6 e4e5 f6d5 c2c4 d5b6 d2d4 d7d6 e5d6 e7d6 g1f3 c8g4 f1e2 f8e7 h2h3 g4f3 e2f3 b8c6 b1a3 e8g8 e1g1 f8e8 b2b3 c6d4 f3b7 a8b8 d1d4 e7f6 d4d1 f6a1 b7c6 e8e6 c6f3 a7a6 a3c2 a1c3 d1d3 d8f6 c2e3 f6e5 f1d1 g7g6 h3h4 e5a5 d3c2 c3g7 h4h5 a5c3 h5h6 expect Rxe3 -> 325 (V: 75.23%) (N: 0.23%) PV: Rxe3 Qxc3 Rxc3 hxg7 Rc2 Bh6 Nd7 Ra1 Ne5 Bd1 Rb2 Bc1 Nd3 Bxb2 Nxb2 Be2 a5 Rb1 ```

id351 vs EXchess game 1

![screen shot 2018-05-31 at 10 41 38 am](https://user-images.githubusercontent.com/438537/40798366-bb296d04-64bf-11e8-9748-313259f61b87.png) ``` position startpos moves g1f3 g8f6 g2g3 e7e6 f1g2 f8e7 c2c4 d7d5 e1g1 e8g8 d2d4 d5c4 f3e5 c7c5 d4c5 d8c7 e5c4 c7c5 b2b3 f8d8 b1d2 c5c7 c1b2 b8c6 a1c1 a8b8 a2a3 f6d5 b3b4 b7b5 c4a5 c8b7 c1c2 e7f8 d1b1 b8c8 f1c1 c7d7 a5b7 d7b7 d2b3 a7a6 e2e3 b7d7 c2d2 d7e7 h2h4 e7b7 b3c5 b7a8 c5e4 h7h6 e4c5 a6a5 c5e6 f7e6 b1g6 d8d6 g2e4 c6e7 g6h7 g8f7 c1d1 a5b4 e4f3 expect Nf6 -> 352 (V: 79.84%) (N: 0.05%) PV: Nf6 Bxa8 Rxd2 Rxd2 Nxh7 Bf3 bxa3 Bxa3 Nf6 Rb2 Rb8 Bd6 Rb6 Bc7 Ra6 Rxb5 Ned5 ```

id351 vs Hakkapeliitta game 1

![screen shot 2018-05-31 at 11 10 17 am](https://user-images.githubusercontent.com/438537/40799832-7ebc6c50-64c3-11e8-918f-46d9df4fad96.png) ``` position startpos moves e2e4 c7c5 g1f3 e7e6 d2d4 c5d4 f3d4 b8c6 b1c3 g8f6 d4c6 b7c6 e4e5 f6d5 c3e4 d8c7 f2f4 c7b6 a2a3 f8e7 c2c4 d5e3 d1d3 e3f1 h1f1 c6c5 f1f2 f7f5 e4d6 e7d6 d3d6 b6d6 e5d6 e8f7 b2b4 c8a6 b4b5 a6b7 a3a4 h7h5 a4a5 h5h4 a5a6 b7e4 c1e3 h4h3 g2g3 h8c8 f2a2 a8b8 a1c1 f7g6 e1f1 g6h5 f1f2 h5g4 a2e2 b8b6 e2d2 e4f3 c1c3 g7g6 c3c1 f3g2 c1c3 g2f3 c3a3 f3e4 a3a1 e4f3 a1c1 f3g2 c1a1 g2e4 a1a3 e4f3 a3c3 f3e4 f2g1 e4f3 c3d3 b6b8 d3c3 f3e4 g1f2 b8b6 f2e2 e4g2 e3g1 g2e4 e2e1 e4f3 d2d3 f3g2 e1e2 g2e4 d3d2 b6b8 d2a2 b8b6 a2d2 b6b8 g1e3 b8b6 c3c1 e4f3 e2f1 b6b8 f1f2 b8b6 c1e1 f3e4 e1d1 e4f3 d1a1 b6b8 f2g1 f3e4 g1f2 b8b6 a1a2 e4f3 a2b2 b6b8 b2b3 b8b6 b3d3 f3e4 d3b3 e4f3 b3b1 f3e4 b1e1 e4f3 f2g1 f3e4 e1f1 e4g2 f1e1 g2f3 g1f2 f3e4 e1f1 e4f3 f2g1 f3e4 g1f2 e4f3 f1g1 f3e4 g1d1 e4f3 f2e1 expect Bxd1 -> 360 (V: 49.82%) (N: 0.27%) PV: Bxd1 Rxd1 Kf3 Bf2 Rbb8 Kf1 Ra8 Rd3+ Ke4 Ke2 e5 fxe5 Kxe5 Rd5+ Ke6 Bxc5 g5 Kd3 Rf8 ```

iCE vs id351 game 1

![screen shot 2018-05-31 at 12 03 37 pm](https://user-images.githubusercontent.com/438537/40802559-c7499cac-64ca-11e8-9629-aa6fca76fd35.png) ``` position startpos moves e2e4 c7c6 g1f3 d7d5 e4e5 c6c5 f1e2 b8c6 e1g1 c8g4 c2c4 d5c4 b1a3 e7e6 a3c4 f8e7 d2d3 g8h6 c1h6 g7h6 d1d2 h6h5 d2f4 h8g8 f1e1 d8d7 a1d1 e8c8 f4f7 h7h6 f7h7 h5h4 h7h6 d8f8 h6h7 c8b8 c4e3 g4f3 e2f3 c6e5 f3e4 e5f7 f2f4 d7c7 e1f1 e7f6 d1e1 f7d6 h7c7 b8c7 b2b3 f6d4 g1h1 b7b5 e1e2 a7a5 e3c2 d4b2 e4f3 c7d7 c2e3 b2d4 a2a4 b5a4 b3a4 f8f4 expect Bc6+ -> 369 (V: 61.86%) (N: 0.06%) PV: Bc6+ Kxc6 Rxf4 Rb8 Nc2 Bc3 Ne3 Rb1+ Rf1 Rb4 Nc4 Nxc4 dxc4 e5 g4 ```

id351 vs Bobcat game 2

![screen shot 2018-05-31 at 12 38 20 pm](https://user-images.githubusercontent.com/438537/40804287-a93cf240-64cf-11e8-8abc-80093dc59056.png) ``` position startpos moves d2d4 d7d5 g1f3 c7c6 c2c4 g8f6 b1c3 d5c4 a2a4 c8f5 e2e3 e7e6 f1c4 b8d7 d1b3 d8b6 a4a5 b6b3 c4b3 f5d3 b3d1 f8d6 d1e2 d3g6 e1g1 e8g8 c1d2 h7h6 f1c1 a7a6 c3a4 f6e4 d2e1 f8e8 g1f1 a8d8 f3d2 e4d2 e1d2 e6e5 d4e5 d7e5 d2c3 e5d7 c1d1 d6e7 a1c1 d7f6 c3d4 f6d7 d4c3 d7f6 c3d4 f6d7 h2h3 g6f5 e2d3 f5e6 d3c4 e6f5 f2f3 c6c5 d4c3 e7g5 g2g4 f5e6 c4e6 e8e6 f3f4 g5e7 f1e2 e6c6 d1d5 c6d6 d5d6 e7d6 c1d1 d6e7 b2b3 f7f6 h3h4 g8f7 h4h5 f7e8 e3e4 d8c8 e2d3 c8c6 d3c4 c6e6 d1e1 e6c6 e4e5 f6e5 f4e5 d7f8 a4b6 f8e6 c4d5 e6c7 d5e4 e8f7 e4f5 g7g6 f5e4 c7b5 c3d2 b5d4 e1b1 d4e2 b1f1 f7e8 f1f3 g6h5 g4h5 e2d4 f3g3 e7f8 b6c4 e8f7 e4d5 d4b5 g3d3 f7e8 d2e3 c6c7 c4d6 b5d6 e5d6 f8d6 expect Bxh6 -> 375 (V: 64.52%) (N: 0.10%) PV: Bxh6 c4 bxc4 Bb4 Be3 Bxa5 c5 Rd7+ Kc4 Rxd3 Kxd3 Kf7 Kc4 Bd8 Kd5 a5 Kc4 ```

Here's the "average tactic training" for various engine configurations and board states:

config	SCTR/359	359/Wasp	351/EXch	351/Hakk	iCE/351	351/Bobc
root PUCT 1.2	19.9%	12.7%	38.9%	57.6%	32.8%	40.4%
default	24.9%	18.8%	28.6%	63.3%	42.7%	44.3%
ε 0.5	31.5%	38.1%	49.1%	56.9%	46.7%	51.0%
α 3.0	52.8%	35.4%	77.2%	75.1%	71.7%	78.3%
ε 0.5, α 3.0	69.9%	65.0%	88.8%	77.4%	79.4%	74.6%
twice	86.0%	13.0%	87.3%	74.2%	83.5%	81.5%

These are all tested with current master https://github.com/LeelaChessZero/lc0/commit/2321011913d8a7914d8177e82ceaf34fbe2d6ee8 where default is PUCT 3.1, α 0.3, ε 0.25, no twice visits. The patches for root PUCT and twice visits are in earlier comments, and noise changes are just changing ApplyDirichletNoise(node, 0.25, 0.3) call.

I used the networks listed (id351 or id359) as notably, the latest id369 has learned the tactic from the Hakkapeliitta game where with PUCT 1.2 (lczero 0.6), an average tactic training of 57.6% was enough to learn it.

Here's the priors for each of the expected moves:

game	old prior	id369 prior
SCTR	0.33%	0.43%
Wasp	0.23%	0.19%
EXchess	0.03%	0.02%
Hakkapeliitta	0.32%	50.61%
iCE	0.08%	0.12%
Bobcat	0.14%	0.32%

So looks like noise is indeed working, and with PUCT change to 3.1, there'll be less of a need to do additional changes to improve training tactics.

Mardak commented 6 years ago

I analyzed all the CCLS id359 games to find low prior moves that were played to see if the same network would have found it with noise. Here's a first set that the other AI played that the network didn't really considered at all. Not all have major swings in win rate or even change the outcome, but at least this first one against Houdini, lczero thought it was 63% win rate but after the 0.15% prior move, it's actually the opponent with 75% win rate.

For each unexpected played move, a screenshot and uci position and what id359 thought of it and the top alternate moves when forced to explore at least 10 visits of 800:

Houdini Round 8 29. Qxe7+ 0.15%

![screen shot 2018-06-04 at 4 35 11 pm](https://user-images.githubusercontent.com/438537/40951840-aea1954a-682d-11e8-9aa1-89470c34c5cb.png) ``` position startpos moves d2d4 e7e6 c2c4 f8b4 c1d2 b4e7 e2e4 d7d5 e4e5 c7c5 d1g4 e7f8 d4c5 h7h5 g4g3 h5h4 g3a3 b8d7 g1f3 f8c5 b2b4 c5b6 d2g5 g8e7 a3b2 h8h5 c4d5 e6d5 f1b5 e8f8 e1g1 d7e5 b2e5 f7f6 e5f4 b6c7 f4e3 f6g5 b1c3 d8d6 b5d3 c7b6 e3e2 h4h3 f1e1 g5g4 f3e5 h5g5 e5g6 g5g6 d3g6 c8d7 g6h5 a8c8 a1c1 c8c4 info string g3 -> 12 ( 1.50%) (V: 31.22%) (N: 15.03%) PV: g3 g6 Nxd5 Qxd5 Qxe7+ info string Nxd5 -> 13 ( 1.62%) (V: 37.27%) (N: 10.20%) PV: Nxd5 Rxc1 Qxe7+ Qxe7 Rxc1 Qe4 info string gxh3 -> 13 ( 1.62%) (V: 37.78%) (N: 8.96%) PV: gxh3 Rf4 Bxg4 Rxf2 Qxf2+ info string Qd2 -> 14 ( 1.75%) (V: 37.34%) (N: 10.51%) PV: Qd2 Rd4 Qg5 Rf4 Re2 Rf5 info string Qxe7+ -> 397 ( 49.56%) (V: 74.70%) (N: 0.15%) PV: Qxe7+ Qxe7 Rxe7 Kxe7 Nxd5+ Kd6 Rxc4 Kxd5 Rf4 g5 Rf7 Be6 Rxb7 Kc4 Bf7 Bxf7 Rxf7 Kxb4 Rf5 ```

Naum Round 3 33. …Qxd5 0.05%

![screen shot 2018-06-04 at 4 22 03 pm](https://user-images.githubusercontent.com/438537/40951813-90cbf47a-682d-11e8-97b3-c764effc4cef.png) ``` position startpos moves d2d4 f7f5 g1f3 e7e6 g2g3 b8c6 f1g2 g8f6 e1g1 d7d5 c2c4 d5c4 d1a4 c8d7 a4c4 f8d6 b1c3 e8g8 c1g5 h7h6 g5f6 d8f6 e2e4 c6a5 c4e2 f6g6 a1d1 g6h5 e4e5 d6b4 d4d5 b4c3 b2c3 a8d8 f1e1 c7c5 c3c4 h5g4 d1c1 f5f4 h2h3 g4g6 g3g4 h6h5 f3h2 h5g4 h2g4 g6g5 g1h2 d8e8 g2f3 g8h8 e1g1 g5h4 e2d2 b7b6 c1c3 e6d5 f3d5 h4h5 c3f3 a5c6 e5e6 c6d4 e6d7 info string Re7 -> 11 ( 1.38%) (V: 10.63%) (N: 11.74%) PV: Re7 Rxf4 Rd8 Re1 Rexd7 Re5 g5 Rf6 info string Re2 -> 11 ( 1.38%) (V: 11.73%) (N: 17.54%) PV: Re2 Qd3 Re7 Rxf4 Rxf4 d8=Q+ info string Nxf3+ -> 15 ( 1.87%) (V: 15.77%) (N: 19.49%) PV: Nxf3+ Bxf3 Rd8 Qd6 Qf5 Ne5 Rf6 Qe7 info string Rd8 -> 26 ( 3.25%) (V: 9.83%) (N: 47.12%) PV: Rd8 Rxf4 Rxf4 Qxf4 Ne2 Qe5 Qxe5+ Nxe5 Nxg1 Nf7+ info string Qxd5 -> 357 ( 44.62%) (V: 52.94%) (N: 0.05%) PV: Qxd5 cxd5 Nxf3+ Kg2 Nxd2 dxe8=Q Rxe8 d6 f3+ Kh2 Rd8 Ne5 Kg8 d7 c4 Rc1 Kf8 Rd1 c3 ```

Scorpio Round 2 22. Rexe7 0.07%

![screen shot 2018-06-04 at 4 16 17 pm](https://user-images.githubusercontent.com/438537/40951882-d020cc4a-682d-11e8-8fef-e87f9eaa1875.png) ``` position startpos moves g1f3 d7d5 e2e3 c7c5 d2d4 g8f6 c2c4 c5d4 e3d4 b8c6 c4d5 f6d5 b1c3 g7g6 f1c4 d5b6 c4b3 f8g7 e1g1 e8g8 d4d5 c6a5 f1e1 a5b3 a2b3 c8g4 h2h3 g4f3 d1f3 f8e8 c1e3 g7c3 e3b6 d8b6 b2c3 b6b3 a1b1 b3a3 b1b7 a7a5 b7d7 a8d8 info string Rb7 -> 11 ( 1.37%) (V: 46.50%) (N: 10.25%) PV: Rb7 Qc5 Rd1 Rb8 Rdb1 Rxb7 info string Rxd8 -> 13 ( 1.62%) (V: 41.82%) (N: 15.61%) PV: Rxd8 Rxd8 Qe3 Rxd5 c4 Qxe3 Rxe3 info string Rc7 -> 26 ( 3.25%) (V: 47.32%) (N: 21.27%) PV: Rc7 a4 Qe4 Ra8 c4 Qc3 info string Ra7 -> 44 ( 5.49%) (V: 41.85%) (N: 48.93%) PV: Ra7 Qc5 Qe3 Rxd5 Qxc5 Rxc5 Rexe7 Rxe7 Rxe7 Rxc3 Ra7 Ra3 g3 a4 Kg2 Kg7 info string Rexe7 -> 360 ( 44.94%) (V: 61.59%) (N: 0.07%) PV: Rexe7 Qxe7 Rxe7 Rxe7 c4 Rc7 Qf6 Rcc8 d6 a4 Qd4 a3 c5 ```

Protector Round 6 13. Nf5+ 0.10%

![screen shot 2018-06-04 at 4 30 29 pm](https://user-images.githubusercontent.com/438537/40951976-47e45008-682e-11e8-86db-7bcccebacb58.png) ``` position startpos moves d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 d5c4 e2e4 f8b4 c1g5 h7h6 g5f6 d8f6 f1c4 c7c5 e1g1 c5d4 e4e5 f6d8 d1d4 d8d4 f3d4 e8e7 info string f4 -> 11 ( 1.38%) (V: 39.85%) (N: 6.77%) PV: f4 Bc5 Rad1 Rd8 Ne2 Nc6 info string Ne4 -> 25 ( 3.12%) (V: 46.17%) (N: 10.55%) PV: Ne4 Rd8 Rfd1 Bd7 a3 Ba5 Nd6 Bb6 Nxb7 info string Be2 -> 40 ( 5.00%) (V: 42.77%) (N: 25.42%) PV: Be2 Rd8 Rfd1 Bd7 Ne4 Nc6 Nf3 Be8 a3 info string Rac1 -> 41 ( 5.13%) (V: 41.14%) (N: 30.11%) PV: Rac1 Rd8 Rfd1 Bd7 Be2 Nc6 Nf3 Be8 Kf1 info string Nf5+ -> 306 ( 38.25%) (V: 54.34%) (N: 0.10%) PV: Nf5+ Kf8 Nd6 Bxd6 exd6 Bd7 Nb5 Na6 f4 g6 Nd4 Rc8 ```

Vajolet Round 5 23. …Nxg4 0.17%

![screen shot 2018-06-04 at 4 27 09 pm](https://user-images.githubusercontent.com/438537/40952007-72c58f44-682e-11e8-8cda-8b0479474f09.png) ``` position startpos moves e2e4 c7c5 g1e2 b8c6 d2d4 c5d4 e2d4 d7d6 c2c4 e7e5 d4c2 f8e7 b1c3 g8f6 f1e2 c8e6 e1g1 e8g8 b2b3 a8c8 c1e3 f6d7 d1d2 f7f5 e4f5 e6f5 a1d1 d8e8 e2d3 f5e6 f2f4 d7f6 f4f5 e6f7 d3e2 c8d8 c3d5 b7b6 g2g4 f7d5 c4d5 c6b8 c2a3 d8c8 a3b5 info string h6 -> 17 ( 2.06%) (V: 25.62%) (N: 9.04%) PV: h6 Nxa7 Rc7 Nb5 Rc8 Na7 Rc7 Nb5 info string Ne4 -> 68 ( 8.22%) (V: 28.77%) (N: 21.73%) PV: Ne4 Qb4 Ng5 Nxa7 Rc2 Rd2 Nh3+ Kg2 Rxd2 Qxd2 info string Qd7 -> 91 ( 11.00%) (V: 31.43%) (N: 13.28%) PV: Qd7 Nc3 h6 h4 Nh7 g5 hxg5 hxg5 Rxf5 Ne4 Rxf1+ Rxf1 Qh3 info string a6 -> 164 ( 19.83%) (V: 30.48%) (N: 41.10%) PV: a6 Nc3 h6 h4 Nh7 g5 hxg5 hxg5 Bxg5 Bxg5 Nxg5 Qxg5 Rxc3 info string Nxg4 -> 171 ( 20.68%) (V: 36.38%) (N: 0.17%) PV: Nxg4 Bxg4 Qxb5 f6 Bxf6 Be6+ Kh8 Bxc8 Rxc8 Rc1 Rg8 Rc7 Nd7 Rxa7 Nc5 Qg2 Qd3 Bxc5 bxc5 Kh1 e4 Re1 ```

Cheng Round 1 24. …Bf4+ 0.01%

![screen shot 2018-06-04 at 4 06 53 pm](https://user-images.githubusercontent.com/438537/40952044-a3d3b6ce-682e-11e8-847a-4bedde7cb38f.png) ``` position startpos moves e2e4 e7e6 c2c4 d7d5 c4d5 e6d5 e4d5 g8f6 f1b5 c8d7 b5c4 d8e7 g1e2 e7e4 d2d3 e4g2 h1g1 g2h2 c1f4 h2h5 d1b3 b7b5 c4b5 f6d5 g1g5 h5h1 g5g1 h1h5 f4g5 d5b6 b1c3 f8d6 e1c1 a7a6 b5d7 b8d7 c3e4 e8g8 e2c3 d6h2 g1h1 d7e5 f2f4 e5f3 d1f1 f3g5 f4g5 info string Nd7 -> 11 ( 1.38%) (V: 26.02%) (N: 4.83%) PV: Nd7 Qc2 Qh3 Qxh2 Qxh2 Rxh2 info string Rab8 -> 26 ( 3.25%) (V: 20.48%) (N: 24.15%) PV: Rab8 Qc2 Qh3 Qxh2 Qxh2 Rxh2 Nd7 d4 Rb4 info string Rad8 -> 39 ( 4.87%) (V: 20.85%) (N: 35.40%) PV: Rad8 Qc2 Qh3 Qxh2 Qxh2 Rxh2 Rxd3 Rfh1 Re8 Rxh7 Kf8 Rh8+ Ke7 info string Qh3 -> 114 ( 14.25%) (V: 33.04%) (N: 18.61%) PV: Qh3 Qd1 Rad8 Qe2 Qxd3 Qxh2 Qe3+ Kb1 h6 info string Bf4+ -> 294 ( 36.75%) (V: 36.33%) (N: 0.01%) PV: Bf4+ Kb1 Qg4 Rhg1 Qf5 Qc2 Rad8 Rf3 Qe5 Qf2 Nd5 Nxd5 Qxd5 Rxf4 Qxd3+ ```

And same analysis as before with 50 noised games from the above board states to calculate the average training:

config	Houdini	Naum	Scorpio	Protector	Vajolet	Cheng
root PUCT 1.2	6.8%	2.2%	4.5%	1.3%	0.3%	19.9%
default	8.2%	21.3%	20.1%	0.6%	4.7%	15.8%
ε 0.5	19.6%	36.5%	23.2%	4.8%	9.5%	33.5%
α 3.0	1.3%	17.2%	26.5%	0.3%	0.3%	38.0%
ε 0.5, α 3.0	10.0%	53.4%	57.6%	1.0%	5.2%	55.5%
twice	7.6%	10.5%	57.0%	2.8%	1.9%	32.2%

PUCT 1.2 generally seems bad for search visiting these very low prior moves to try to increase training
A higher epsilon (with default 0.3 alpha) allows a spike in prior to allow deeper exploration of a given move, which seems to be needed to play out the Houdini game's sequence of 8 captures / checks
A higher alpha results in a flatter noise just generally allowing all moves to be explored, and along with a higher epsilon helps reduce any strongly biased existing priors
Twice visits is similar to the flat noise except it ensures search gets an eval after a potential forced followup move

Mardak commented 6 years ago

I rebased @ASilver's params from #46 onto https://github.com/LeelaChessZero/lc0/commit/2321011913d8a7914d8177e82ceaf34fbe2d6ee8 where I did the tests from earlier. Analyzing the same 12 games from earlier with the same networks each with 50 runs of noise, the average training does go up quite a bit. This is most likely from softmax, as it increases the policy quite a bit in each of these cases, where the usual priors are much lower. I've included the training numbers and move prior for default and with the adjusted params:

game	default training	adjusted training	default prior	adjusted prior
SCTR	24.9%	66.9%	0.33%	1.81% (5.5x)
Wasp	18.8%	77.2%	0.23%	1.68% (7.3x)
EXchess	28.6%	81.0%	0.03%	0.60% (20x)
Hakkapeliitta	78.9%	78.7%	0.32%	2.14% (6.7x)
iCE	42.7%	77.7%	0.08%	0.97% (12x)
Bobcat	44.3%	75.5%	0.14%	3.03% (22x)
Houdini	8.2%	19.5%	0.15%	1.02% (6.8x)
Naum	21.3%	39.0%	0.05%	0.87% (17x)
Scorpio	20.1%	50.3%	0.07%	1.09% (16x)
Protector	0.6%	5.0%	0.10%	0.86% (8.6x)
Vajolet	4.7%	14.8%	0.17%	1.32% (7.8x)
Cheng	15.8%	53.0%	0.01%	0.38% (38x)

As @killerducky pointed out earlier, we probably shouldn't touch the training until other things are fixed, so these numbers are at least reassuring in that if the network never got so biased to avoid these moves to begin with, they would pretty naturally find these correct moves with the default noise settings.

In terms of network progression, even with a clean start, priors could end up very low like these 0.0x% numbers because the value head hadn't learned to favor a position, so training search would give less visits. But I suppose that could be something revisited later if it seems to be stuck again and failing to generate useful training.

Mardak commented 6 years ago

There was a request to check the latest lc0 test id14 4ce96dba to see what it thought of each of these games. Looks like the network already avoids most of these moves except a couple games and has trouble generating training data to increase prior for the move. Below are the average training using default noise as well as the priors for the move:

game	training	prior
SCTR	10.8%	0.06%
Wasp	8.2%	0.14%
EXchess	2.0%	0.04%
Hakkapeliitta	98.8%	94.69%
iCE	23.9%	0.27%
Bobcat	92.6%	2.89%
Houdini	5.5%	0.20%
Naum	16.4%	0.07%
Scorpio	1.2%	0.07%
Protector	0.2%	0.07%
Vajolet	1.6%	0.16%
Cheng	0.8%	0.45%

Here's a check to see if the network would have found the move if forced to explore at least 20 visits out of 2000:

         sctr  Rxh4 ->    1529 ( 76.45%) (V: 74.10%) (N:  0.06%) PV: Rxh4 Nf4 Rg4 Ng6 Rc4 Rxc4 Bxc4 Kd7 Bf7 Ne5 Bd5 Kd6 Bb3 Nc6 Bf7 Ne5 Bh5 Ke6 Bg7
         wasp  Rxe3 ->     531 ( 26.54%) (V: 57.76%) (N:  0.14%) PV: Rxe3 Qxc3 Rxc3 hxg7 Nxc4 bxc4 Kxg7 Be3 Rxc4 Bd5 Ra4 Rc1 c5 Bb3 Rab4 Bd5 Ra4 Bb3
      exchess   Nf6 ->    1169 ( 58.45%) (V: 68.48%) (N:  0.04%) PV: Nf6 Rxd6 Qxf3 Bxf6 Kxf6 axb4 Nf5 Rd7 Bxb4 h5 Rc2 Qg6+ Ke5 Rf1 Bc5 g4 Nxe3 Qxg7+ Ke4
hakkapeliitta  Bxd1 ->    1456 ( 72.76%) (V: 61.50%) (N: 94.69%) PV: Bxd1 Rxd1 Kf3 Rd3 Kg2 Ke2 Kxh2 Kf2 Kh1 Rd1+ Kh2 Kf3 Rbb8 g4 fxg4+ Kxg4 Kg2 Rd2+ Kf1 Kxh3 Ke1 Kg4 Rb6 Kg5 Rg8 Kf6 Rf8+ Ke7 Rc8
          ice  Bc6+ ->    1392 ( 69.60%) (V: 68.57%) (N:  0.27%) PV: Bc6+ Kxc6 Rxf4 Rb8 Nc2 e5 Rxh4 Rb3 Rh3 Kd5 g4 Rb1+ Re1 Rxe1+ Nxe1 c4 dxc4+ Kxc4 g5
       bobcat  Bxh6 ->    1594 ( 79.70%) (V: 65.11%) (N:  2.89%) PV: Bxh6 c4 bxc4 Bb4 Be3 Rd7+ Ke4 Rxd3 Kxd3 Bxa5 h6 Kf7 c5 Kg8 Kc4 Kh7 Kd5 Bd8 Kd6
      houdini Qxe7+ ->    1254 ( 62.67%) (V: 76.36%) (N:  0.20%) PV: Qxe7+ Qxe7 Rxe7 Kxe7 Nxd5+ Kd6 Rxc4 Kxd5 Rf4 Ke5 g3 g5 Rf7 Ke6 Bxg4+ Kxf7 Bxd7 Ke7 Bxh3 Kf6 Bc8 Bd4 Bxb7
         naum  Qxd5 ->    1176 ( 58.80%) (V: 54.99%) (N:  0.07%) PV: Qxd5 dxe8=Q Nxf3+ Kg2 Nxd2+ cxd5 Rxe8 Rd1 Nc4 Kf3 Nd6 Kxf4 Re2 a4 c4 Ne3 Rxf2+ Ke5 Rf6 Kd4
      scorpio Rexe7 ->    1226 ( 61.21%) (V: 64.01%) (N:  0.07%) PV: Rexe7 Rxe7 Rxd8+ Kg7 d6 Re1+ Kh2 Qc1 Rd7 Rh1+ Kg3 Qg5+ Qg4 Qe5+ Qf4 Qxc3+ Kh4 Qf6+ Qxf6+ Kxf6 Ra7 Rd1 d7 Rd4+ Kg3 a4 Kf3 Ke7 Ke3 Rxd7
    protector  Nf5+ ->     782 ( 39.08%) (V: 54.66%) (N:  0.07%) PV: Nf5+ Kf8 Nd6 Nc6 f4 Bc5+ Kh1 Bxd6 exd6 Bd7 Rad1 Rd8 Bb5 a6 Bxc6 Bxc6 f5
      vajolet  Nxg4 ->      40 (  1.58%) (V: 40.04%) (N:  0.16%) PV: Nxg4 Bxg4 Qxb5 f6 Bxf6 Bxc8 Rxc8 Qg2 Nd7 Qh3 Rc2 Qe6+ Kh8 Qe8+ Nf8
        cheng  Bf4+ ->     708 ( 35.40%) (V: 48.03%) (N:  0.45%) PV: Bf4+ Kb1 Qg4 Rhg1 Qf5 Ne2 Nd5 Nd4 Qe5 Nf3 Qf5 Nh4 Qe6 Nc5 Qd6 Ne4 Qc6

So the network with its value and policy could generate valuable training data if search was biased the right away in almost all the cases. Unclear if this is a limitation on the size of the network.

I'm not sure if it's concerning yet that the network would generate average training quite a bit less compared to the previous comment with default training.

Mardak commented 6 years ago

Here's a look at the prior progression for moves from Scorpio vs id359 Round 2 from https://github.com/LeelaChessZero/lc0/issues/8#issuecomment-394565160 using the lc0 test networks:

For reference, here's the board and moves id359 would have found when forced: screen shot 2018-06-04 at 4 16 17 pm

position startpos moves g1f3 d7d5 e2e3 c7c5 d2d4 g8f6 c2c4 c5d4 e3d4 b8c6 c4d5 f6d5 b1c3 g7g6 f1c4 d5b6 c4b3 f8g7 e1g1 e8g8 d4d5 c6a5 f1e1 a5b3 a2b3 c8g4 h2h3 g4f3 d1f3 f8e8 c1e3 g7c3 e3b6 d8b6 b2c3 b6b3 a1b1 b3a3 b1b7 a7a5 b7d7 a8d8

info string   Rb7 ->      11 (  1.37%) (V: 46.50%) (N: 10.25%) PV: Rb7 Qc5 Rd1 Rb8 Rdb1 Rxb7
info string  Rxd8 ->      13 (  1.62%) (V: 41.82%) (N: 15.61%) PV: Rxd8 Rxd8 Qe3 Rxd5 c4 Qxe3 Rxe3
info string   Rc7 ->      26 (  3.25%) (V: 47.32%) (N: 21.27%) PV: Rc7 a4 Qe4 Ra8 c4 Qc3
info string   Ra7 ->      44 (  5.49%) (V: 41.85%) (N: 48.93%) PV: Ra7 Qc5 Qe3 Rxd5 Qxc5 Rxc5 Rexe7 Rxe7 Rxe7 Rxc3 Ra7 Ra3 g3 a4 Kg2 Kg7
info string Rexe7 ->     360 ( 44.94%) (V: 61.59%) (N:  0.07%) PV: Rexe7 Qxe7 Rxe7 Rxe7 c4 Rc7 Qf6 Rcc8 d6 a4 Qd4 a3 c5

Mardak commented 6 years ago

I analyzed id396 from @scs-ben compared to id395 for these positions, and some have significantly better average tactics training even though the same training data resulted in similar priors for the correct move for both networks. The main difference is that the value evaluation for the expected move is generally favorable, which drives more visits during search.

The Scorpio game in the previous comment shows the new network believes the move is winning (V: 3.05%) where the previous network thinks it's losing (V: -38.69%), so even though priors are very low around 0.2% for both, the id396's value results in 60.3% training up from 18.0%!

@killerducky is this change in value expected from 50-normalization? It should definitely help generate better training data in some cases! 👍

game	395 train	395 V	395 P	396 train	396 V	396 P
sctr	19.2%	-46.53%	0.33%	43.2%	-4.32%	0.91%
wasp	37.9%	38.19%	0.16%	69.5%	13.13%	2.53%
exchess	33.8%	-10.42%	0.58%	39.4%	22.87%	0.22%
hakka	56.0%	-25.54%	6.09%	100.0%	-3.53%	69.51%
ice	34.9%	-34.94%	0.20%	50.0%	-17.78%	0.45%
bobcat	52.3%	43.05%	0.75%	56.2%	39.69%	0.26%
houdini	6.9%	-53.02%	1.02%	3.2%	-51.08%	0.29%
naum	17.1%	-84.85%	0.11%	12.1%	-79.72%	0.63%
scorpio	18.0%	-38.69%	0.15%	60.3%	3.05%	0.26%
protector	4.3%	-44.56%	1.89%	3.0%	-43.12%	0.55%
vajolet2	1.3%	-67.63%	2.56%	9.7%	-55.49%	1.69%
cheng	20.8%	-44.04%	1.78%	4.2%	-59.30%	8.54%

For reference in the Scorpio game, here's id396 V for each move showing two moves have positive V:

info string f3f5  (589 ) N:       2 (+ 0) (V: -89.32%) (P:  1.01%) (Q: -0.91407) (U: 0.30353) (Q+U: -0.61053) 
info string e1e3  (111 ) N:       2 (+ 0) (V: -64.44%) (P:  0.90%) (Q: -0.72748) (U: 0.27000) (Q+U: -0.45748) 
info string e1e4  (115 ) N:       2 (+ 0) (V: -66.91%) (P:  0.37%) (Q: -0.73500) (U: 0.11179) (Q+U: -0.62322) 
info string d7e7  (1480) N:       2 (+ 0) (V: -14.48%) (P:  1.97%) (Q: -0.42458) (U: 0.59003) (Q+U:  0.16545) 
info string d7d6  (1474) N:       2 (+ 0) (V: -83.71%) (P:  0.11%) (Q: -0.83561) (U: 0.03143) (Q+U: -0.80418) 
info string e1e5  (118 ) N:       2 (+ 0) (V: -67.00%) (P:  0.48%) (Q: -0.73867) (U: 0.14478) (Q+U: -0.59389) 
info string d5d6  (1007) N:       2 (+ 0) (V: -46.27%) (P:  0.80%) (Q: -0.59301) (U: 0.23789) (Q+U: -0.35512) 
info string e1e6  (119 ) N:       2 (+ 0) (V: -71.98%) (P:  1.23%) (Q: -0.77920) (U: 0.36941) (Q+U: -0.40979) 
info string f3d1  (565 ) N:       2 (+ 0) (V: -71.51%) (P:  0.25%) (Q: -0.74729) (U: 0.07373) (Q+U: -0.67356) 
info string f3e2  (571 ) N:       2 (+ 0) (V: -62.67%) (P:  2.14%) (Q: -0.69184) (U: 0.63923) (Q+U: -0.05261) 
info string f3e4  (583 ) N:       2 (+ 0) (V: -70.92%) (P:  0.21%) (Q: -0.75161) (U: 0.06385) (Q+U: -0.68776) 
info string f3h5  (591 ) N:       2 (+ 0) (V: -90.01%) (P:  0.20%) (Q: -0.91669) (U: 0.06100) (Q+U: -0.85569) 
info string e1f1  (101 ) N:       2 (+ 0) (V: -68.73%) (P:  0.14%) (Q: -0.74611) (U: 0.04294) (Q+U: -0.70317) 
info string f3d3  (578 ) N:       2 (+ 0) (V: -71.21%) (P:  0.08%) (Q: -0.75146) (U: 0.02537) (Q+U: -0.72608) 
info string f3e3  (579 ) N:       2 (+ 0) (V: -56.49%) (P:  0.21%) (Q: -0.66993) (U: 0.06261) (Q+U: -0.60732) 
info string f3g3  (580 ) N:       2 (+ 0) (V: -67.80%) (P:  0.20%) (Q: -0.73818) (U: 0.05993) (Q+U: -0.67825) 
info string f3f7  (595 ) N:       2 (+ 0) (V: -59.11%) (P:  0.13%) (Q: -0.75464) (U: 0.03975) (Q+U: -0.71488) 
info string f3f6  (593 ) N:       2 (+ 0) (V: -79.17%) (P:  0.60%) (Q: -0.86619) (U: 0.17863) (Q+U: -0.68756) 
info string e1d1  (100 ) N:       2 (+ 0) (V: -66.71%) (P:  0.14%) (Q: -0.73520) (U: 0.04126) (Q+U: -0.69394) 
info string f3f4  (584 ) N:       2 (+ 0) (V: -68.33%) (P:  0.42%) (Q: -0.74719) (U: 0.12697) (Q+U: -0.62022) 
info string c3c4  (485 ) N:       2 (+ 0) (V: -38.13%) (P:  2.11%) (Q: -0.54246) (U: 0.63084) (Q+U:  0.08838) 
info string g2g4  (378 ) N:       2 (+ 0) (V: -70.85%) (P:  0.25%) (Q: -0.76396) (U: 0.07510) (Q+U: -0.68886) 
info string g2g3  (374 ) N:       2 (+ 0) (V: -73.14%) (P:  0.45%) (Q: -0.77798) (U: 0.13459) (Q+U: -0.64339) 
info string g1h2  (157 ) N:       2 (+ 0) (V: -70.19%) (P:  0.08%) (Q: -0.76420) (U: 0.02368) (Q+U: -0.74052) 
info string g1h1  (153 ) N:       2 (+ 0) (V: -70.44%) (P:  1.54%) (Q: -0.76120) (U: 0.45974) (Q+U: -0.30146) 
info string g1f1  (152 ) N:       2 (+ 0) (V: -72.05%) (P:  1.11%) (Q: -0.77256) (U: 0.33322) (Q+U: -0.43934) 
info string e1a1  (97  ) N:       2 (+ 0) (V: -74.85%) (P:  0.07%) (Q: -0.82456) (U: 0.02079) (Q+U: -0.80377) 
info string e1b1  (98  ) N:       2 (+ 0) (V: -65.31%) (P:  0.09%) (Q: -0.72224) (U: 0.02801) (Q+U: -0.69423) 
info string e1c1  (99  ) N:       2 (+ 0) (V: -80.15%) (P:  0.12%) (Q: -0.85486) (U: 0.03508) (Q+U: -0.81978) 
info string e1e2  (106 ) N:       3 (+ 0) (V: -71.59%) (P:  4.33%) (Q: -0.76985) (U: 0.97152) (Q+U:  0.20167) 
info string h3h4  (642 ) N:       4 (+ 0) (V: -68.01%) (P:  5.55%) (Q: -0.75801) (U: 0.99557) (Q+U:  0.23756) 
info string f3g4  (585 ) N:       9 (+ 1) (V: -14.25%) (P:  3.15%) (Q:  0.00519) (U: 0.25699) (Q+U:  0.26218) 
info string d7d8  (1486) N:      22 (+ 1) (V:  13.70%) (P:  9.31%) (Q: -0.09348) (U: 0.34820) (Q+U:  0.25472) 
info string d7c7  (1479) N:      36 (+ 2) (V:  -8.47%) (P: 16.93%) (Q: -0.14014) (U: 0.38951) (Q+U:  0.24937) 
info string d7b7  (1478) N:      51 (+ 0) (V:  -4.70%) (P: 13.46%) (Q: -0.04031) (U: 0.23233) (Q+U:  0.19203) 
info string d7a7  (1477) N:      63 (+ 4) (V: -10.19%) (P: 29.58%) (Q: -0.12718) (U: 0.39032) (Q+U:  0.26314) 
info string e1e7  (120 ) N:     592 (+118) (V:   3.05%) (P:  0.26%) (Q:  0.23203) (U: 0.00033) (Q+U:  0.23236)

Whereas with id395, only one move:

info string f3f5  (589 ) N:       2 (+ 0) (V: -94.07%) (P:  0.09%) (Q: -0.94612) (U: 0.02797) (Q+U: -0.91815) 
info string e1e3  (111 ) N:       2 (+ 0) (V: -68.33%) (P:  0.12%) (Q: -0.74048) (U: 0.03665) (Q+U: -0.70383) 
info string e1e4  (115 ) N:       2 (+ 0) (V: -72.54%) (P:  0.03%) (Q: -0.76590) (U: 0.00885) (Q+U: -0.75705) 
info string d7e7  (1480) N:       2 (+ 0) (V: -46.42%) (P:  0.31%) (Q: -0.61692) (U: 0.09210) (Q+U: -0.52482) 
info string d7d6  (1474) N:       2 (+ 0) (V: -87.56%) (P:  0.06%) (Q: -0.86254) (U: 0.01934) (Q+U: -0.84320) 
info string e1e5  (118 ) N:       2 (+ 0) (V: -72.77%) (P:  0.21%) (Q: -0.77803) (U: 0.06221) (Q+U: -0.71582) 
info string d5d6  (1007) N:       2 (+ 0) (V: -41.78%) (P:  0.30%) (Q: -0.59649) (U: 0.09057) (Q+U: -0.50592) 
info string h3h4  (642 ) N:       2 (+ 0) (V: -67.54%) (P:  1.28%) (Q: -0.72365) (U: 0.38191) (Q+U: -0.34174) 
info string f3d1  (565 ) N:       2 (+ 0) (V: -76.02%) (P:  0.53%) (Q: -0.79143) (U: 0.15995) (Q+U: -0.63148) 
info string f3e2  (571 ) N:       2 (+ 0) (V: -70.15%) (P:  0.11%) (Q: -0.75128) (U: 0.03408) (Q+U: -0.71720) 
info string f3e4  (583 ) N:       2 (+ 0) (V: -67.89%) (P:  0.27%) (Q: -0.74062) (U: 0.07949) (Q+U: -0.66113) 
info string f3h5  (591 ) N:       2 (+ 0) (V: -88.92%) (P:  0.19%) (Q: -0.91880) (U: 0.05676) (Q+U: -0.86204) 
info string e1e6  (119 ) N:       2 (+ 0) (V: -72.22%) (P:  0.23%) (Q: -0.77462) (U: 0.06997) (Q+U: -0.70466) 
info string f3d3  (578 ) N:       2 (+ 0) (V: -74.92%) (P:  1.75%) (Q: -0.78349) (U: 0.52235) (Q+U: -0.26115) 
info string e1d1  (100 ) N:       2 (+ 0) (V: -69.40%) (P:  0.10%) (Q: -0.75805) (U: 0.02927) (Q+U: -0.72878) 
info string f3g3  (580 ) N:       2 (+ 0) (V: -76.35%) (P:  0.16%) (Q: -0.78368) (U: 0.04775) (Q+U: -0.73594) 
info string f3f7  (595 ) N:       2 (+ 0) (V: -73.75%) (P:  0.60%) (Q: -0.83331) (U: 0.17828) (Q+U: -0.65503) 
info string f3f6  (593 ) N:       2 (+ 0) (V: -79.65%) (P:  0.07%) (Q: -0.87256) (U: 0.02183) (Q+U: -0.85074) 
info string e1e2  (106 ) N:       2 (+ 0) (V: -72.63%) (P:  0.45%) (Q: -0.76863) (U: 0.13469) (Q+U: -0.63394) 
info string f3f4  (584 ) N:       2 (+ 0) (V: -67.04%) (P:  0.16%) (Q: -0.72483) (U: 0.04779) (Q+U: -0.67704) 
info string e1c1  (99  ) N:       2 (+ 0) (V: -89.23%) (P:  0.78%) (Q: -0.91270) (U: 0.23458) (Q+U: -0.67812) 
info string g2g4  (378 ) N:       2 (+ 0) (V: -69.55%) (P:  0.09%) (Q: -0.76088) (U: 0.02782) (Q+U: -0.73306) 
info string g2g3  (374 ) N:       2 (+ 0) (V: -74.27%) (P:  0.38%) (Q: -0.78336) (U: 0.11372) (Q+U: -0.66964) 
info string g1h2  (157 ) N:       2 (+ 0) (V: -75.21%) (P:  0.06%) (Q: -0.78616) (U: 0.01766) (Q+U: -0.76850) 
info string g1h1  (153 ) N:       2 (+ 0) (V: -75.01%) (P:  0.60%) (Q: -0.78633) (U: 0.17805) (Q+U: -0.60828) 
info string g1f1  (152 ) N:       2 (+ 0) (V: -75.11%) (P:  1.99%) (Q: -0.79003) (U: 0.59606) (Q+U: -0.19397) 
info string e1a1  (97  ) N:       2 (+ 0) (V: -81.70%) (P:  3.31%) (Q: -0.87187) (U: 0.98900) (Q+U:  0.11714) 
info string e1b1  (98  ) N:       2 (+ 0) (V: -69.44%) (P:  0.52%) (Q: -0.75769) (U: 0.15536) (Q+U: -0.60232) 
info string c3c4  (485 ) N:       3 (+ 0) (V: -27.98%) (P:  0.11%) (Q: -0.40043) (U: 0.02501) (Q+U: -0.37541) 
info string f3e3  (579 ) N:       3 (+ 0) (V: -62.50%) (P:  4.24%) (Q: -0.74483) (U: 0.95081) (Q+U:  0.20598) 
info string f3g4  (585 ) N:       3 (+ 0) (V: -24.01%) (P:  0.92%) (Q: -0.10577) (U: 0.20585) (Q+U:  0.10008) 
info string e1f1  (101 ) N:       4 (+ 0) (V: -74.66%) (P:  6.08%) (Q: -0.81843) (U: 1.09180) (Q+U:  0.27337) 
info string d7d8  (1486) N:      17 (+ 0) (V:  -5.86%) (P: 11.04%) (Q: -0.26225) (U: 0.55060) (Q+U:  0.28835) 
info string d7b7  (1478) N:      43 (+ 0) (V:  -3.24%) (P:  8.22%) (Q: -0.05280) (U: 0.16761) (Q+U:  0.11482) 
info string d7c7  (1479) N:      61 (+ 0) (V:   5.63%) (P: 19.26%) (Q: -0.09869) (U: 0.27884) (Q+U:  0.18016) 
info string d7a7  (1477) N:      87 (+ 0) (V:  -4.53%) (P: 35.05%) (Q: -0.08299) (U: 0.35741) (Q+U:  0.27442) 
info string e1e7  (120 ) N:     561 (+55) (V: -38.69%) (P:  0.32%) (Q:  0.28310) (U: 0.00046) (Q+U:  0.28356)

killerducky commented 6 years ago

I wouldn't say it's directly expected. But r50 was breaking the net, so with it fixed hopefully the net will fix other things too, or we will find the next problem.

Mardak commented 6 years ago

There looks to be quite a bit of difference between id401 and id402. I'm surprised at how much the value can change in just one network.

game	401 train	401 V	401 P	402 train	402 V	402 P
sctr	23.3%	-39.33%	0.81%	50.2%	9.49%	1.50%
wasp	55.5%	23.36%	0.55%	70.7%	46.39%	1.31%
exchess	62.4%	-7.87%	1.09%	40.3%	52.42%	0.21%
hakka	100.0%	-17.70%	71.05%	100.0%	-11.81%	68.55%
ice	30.2%	-19.84%	0.29%	47.2%	19.62%	0.81%
bobcat	74.5%	21.60%	0.36%	69.8%	54.93%	6.07%
houdini	0.4%	-42.82%	0.68%	1.6%	-41.09%	0.22%
naum	19.8%	-72.85%	2.93%	37.9%	-37.60%	0.89%
scorpio	29.1%	-13.97%	1.21%	41.7%	-19.40%	0.30%
protector	1.2%	-57.93%	1.65%	3.1%	-37.20%	1.44%
vajolet2	1.7%	-74.91%	1.87%	7.6%	-45.64%	3.04%
cheng	2.6%	-72.67%	0.97%	5.3%	-50.46%	0.26%

Mardak commented 6 years ago

Here's the progression of priors for the 3 lc0 tests so far. Looks like after the learning rate change to 0.01 for test 3, the max change in prior was reduced from id250 averaging 7%, but then after id304, the changes jumped up to an average 23%.

screen shot 2018-06-14 at 9 20 51 am screen shot 2018-06-14 at 9 19 22 am

Mardak commented 6 years ago

Here's the behavior of test 3 searching for the best move for each of the board positions from this issue (i.e., load the position then go nodes 800 and see how many visits out of 800 without smart pruning): screen shot 2018-06-14 at 5 33 40 pm

Looks like it pretty solidly learned two: hakkapeliitta and bobcat; and conflicted learning two: houdini and ice. There were brief blips of learning then forgetting sctr, exchess and scorpio. And never really ever considered wasp, naum, protector, vajolet and cheng.

I would guess the conflicted learning is that the NN sees it like other similar positions, so it's constantly training two or more different correct moves. I'm not sure if this would be addressed by a larger network that could differentiate the positions better?

Here's the same analysis for Test 1 also learning hakkapeliitta and bobcat; briefly exploring houdini and ice; and none of the others after the initial noise: screen shot 2018-06-14 at 7 03 01 pm

And Test 2 also learned hakkapeliitta and bobcat; conflicted for sctr (and maybe scorpio at the end?); and none of the others after the initial noise: screen shot 2018-06-14 at 6 28 49 pm

For reference, id395 and later main networks after 50-normalization have learned only just hakkapeliitta and none of the others.

Edit: Test 4 (?? normally numbered 1-57, but I added 500): screen shot 2018-06-17 at 1 29 35 pm

Edit: Test 8 including value-only/policy-less search as dotted lines:

Mardak commented 6 years ago

Rerunning the original "SCTR" position with 11089 with varying visits (no noise, no softmax, no aversion):

 800: info string a4h4  (666 ) N:       0 (+ 0) (P:  0.62%) (Q: -1.11990) (U: 0.59557) (Q+U: -0.52433) (V:  -.----) 
1600: info string a4h4  (666 ) N:       0 (+ 0) (P:  0.62%) (Q: -1.12639) (U: 0.84253) (Q+U: -0.28386) (V:  -.----) 
3200: info string a4h4  (666 ) N:     903 (+ 1) (P:  0.62%) (Q:  0.68225) (U: 0.00132) (Q+U:  0.68357) (V:  0.1151) 
6400: info string a4h4  (666 ) N:    4102 (+ 1) (P:  0.62%) (Q:  0.65462) (U: 0.00041) (Q+U:  0.65503) (V:  0.1151)

Those would estimate average policy training from the existing 0.62% to: 0%, 0%, 28%, 64%

And with "Wasp" position:

 800: info string e6e3  (560 ) N:     146 (+ 1) (P:  1.15%) (Q:  0.61700) (U: 0.00748) (Q+U:  0.62448) (V:  0.3566) 
1600: info string e6e3  (560 ) N:     941 (+ 1) (P:  1.15%) (Q:  0.57949) (U: 0.00166) (Q+U:  0.58115) (V:  0.3566) 
3200: info string e6e3  (560 ) N:    2528 (+ 1) (P:  1.15%) (Q:  0.58814) (U: 0.00088) (Q+U:  0.58901) (V:  0.3566) 
6400: info string e6e3  (560 ) N:    5695 (+ 1) (P:  1.15%) (Q:  0.61024) (U: 0.00055) (Q+U:  0.61079) (V:  0.3566)

Similarly increasing 1.15% prior towards: 18%, 59%, 79%, 89%.

For reference, here's the other top visited moves at 6400:

SCTR
info string a4c4  (661 ) N:     333 (+ 0) (P: 32.39%) (Q: -0.32429) (U: 0.26372) (Q+U: -0.06057) (V:  0.0365) 
info string a6c4  (1148) N:     341 (+ 0) (P: 13.26%) (Q: -0.17237) (U: 0.10547) (Q+U: -0.06690) (V:  0.0656) 
info string a6d3  (1145) N:    1161 (+ 0) (P: 17.40%) (Q: -0.19576) (U: 0.04073) (Q+U: -0.15503) (V:  0.0415) 
info string a4h4  (666 ) N:    4102 (+ 1) (P:  0.62%) (Q:  0.65462) (U: 0.00041) (Q+U:  0.65503) (V:  0.1151) 

Wasp
info string g7f6  (373 ) N:       5 (+ 0) (P:  1.01%) (Q:  0.12200) (U: 0.45651) (Q+U:  0.57851) (V:  0.1824) 
info string g7h8  (364 ) N:     125 (+ 0) (P: 21.82%) (Q:  0.13783) (U: 0.47100) (Q+U:  0.60883) (V:  0.2042) 
info string c3c2  (1219) N:     569 (+ 0) (P: 72.29%) (Q:  0.26552) (U: 0.34494) (Q+U:  0.61046) (V:  0.3406) 
info string e6e3  (560 ) N:    5695 (+ 1) (P:  1.15%) (Q:  0.61024) (U: 0.00055) (Q+U:  0.61079) (V:  0.3566)

At least for these tactical positions where other moves would be significantly worse than the one correct play, increasing visits allows MCTS to eventually give enough visits to the higher prior moves to then find the hidden tactics.

So instead of adjusting noise in various ways, just simply doubling visits should lead to significantly higher visits to the correct move and consequently rapidly increasing the prior training above the noise threshold.

(Increasing visits improves policy head while keeping the existing noise settings, and increasing visits also improves value head while keeping existing temperature without needing #237.)

Mardak commented 6 years ago

I reran the positions with 11089, and things definitely seem better than before finding 6 of 12 correct tactical moves with self-play settings and 800 visits.

./lc0 -w idlc0-11089 --verbose-move-stats --policy-softmax-temp=1 --cpuct=1.2 --minibatch-size=1 --futile-search-aversion=0

SCTR
position startpos moves d2d4 d7d5 c1f4 g7g6 e2e3 g8f6 c2c4 c7c5 d4c5 f8g7 b1c3 d8a5 c4d5 f6d5 d1d5 g7c3 b2c3 a5c3 e1e2 c3a1 f4e5 a1b1 e5h8 c8e6 d5d3 b1a2 e2f3 f7f6 h8g7 b8d7 f3g3 a8c8 c5c6 c8c6 d3d4 c6d6 d4b4 d6b6 b4h4 d7c5 h2h3 b6b2 g1e2 a2d5 g3h2 d5e5 e2g3 h7h5 h4d4 e5d4 e3d4 c5b3 g7h6 h5h4 g3e4 g6g5 f1d3 b3d4 h1a1 a7a6 e4c5 b2f2 d3e4 e6f5 e4b7 f2c2 a1a4 d4e2 c5e4 f5e4 b7e4 c2c1 e4d3 e2f4 d3a6 f4h5
info string a4h4  (666 ) N:       0 (+ 0) (P:  0.62%) (Q: -0.82955) (U: 0.21020) (Q+U: -0.61935) (V:  -.----) 

Wasp
position startpos moves e2e4 g8f6 e4e5 f6d5 c2c4 d5b6 d2d4 d7d6 e5d6 e7d6 g1f3 c8g4 f1e2 f8e7 h2h3 g4f3 e2f3 b8c6 b1a3 e8g8 e1g1 f8e8 b2b3 c6d4 f3b7 a8b8 d1d4 e7f6 d4d1 f6a1 b7c6 e8e6 c6f3 a7a6 a3c2 a1c3 d1d3 d8f6 c2e3 f6e5 f1d1 g7g6 h3h4 e5a5 d3c2 c3g7 h4h5 a5c3 h5h6
info string e6e3  (560 ) N:       0 (+ 0) (P:  1.15%) (Q: -0.65881) (U: 0.39077) (Q+U: -0.26804) (V:  -.----) 

EXchess
position startpos moves g1f3 g8f6 g2g3 e7e6 f1g2 f8e7 c2c4 d7d5 e1g1 e8g8 d2d4 d5c4 f3e5 c7c5 d4c5 d8c7 e5c4 c7c5 b2b3 f8d8 b1d2 c5c7 c1b2 b8c6 a1c1 a8b8 a2a3 f6d5 b3b4 b7b5 c4a5 c8b7 c1c2 e7f8 d1b1 b8c8 f1c1 c7d7 a5b7 d7b7 d2b3 a7a6 e2e3 b7d7 c2d2 d7e7 h2h4 e7b7 b3c5 b7a8 c5e4 h7h6 e4c5 a6a5 c5e6 f7e6 b1g6 d8d6 g2e4 c6e7 g6h7 g8f7 c1d1 a5b4 e4f3
info string d5f6  (751 ) N:     154 (+ 1) (P:  2.55%) (Q:  0.16138) (U: 0.00555) (Q+U:  0.16693) (V:  0.3750) 

Hakkapeliitta
position startpos moves e2e4 c7c5 g1f3 e7e6 d2d4 c5d4 f3d4 b8c6 b1c3 g8f6 d4c6 b7c6 e4e5 f6d5 c3e4 d8c7 f2f4 c7b6 a2a3 f8e7 c2c4 d5e3 d1d3 e3f1 h1f1 c6c5 f1f2 f7f5 e4d6 e7d6 d3d6 b6d6 e5d6 e8f7 b2b4 c8a6 b4b5 a6b7 a3a4 h7h5 a4a5 h5h4 a5a6 b7e4 c1e3 h4h3 g2g3 h8c8 f2a2 a8b8 a1c1 f7g6 e1f1 g6h5 f1f2 h5g4 a2e2 b8b6 e2d2 e4f3 c1c3 g7g6 c3c1 f3g2 c1c3 g2f3 c3a3 f3e4 a3a1 e4f3 a1c1 f3g2 c1a1 g2e4 a1a3 e4f3 a3c3 f3e4 f2g1 e4f3 c3d3 b6b8 d3c3 f3e4 g1f2 b8b6 f2e2 e4g2 e3g1 g2e4 e2e1 e4f3 d2d3 f3g2 e1e2 g2e4 d3d2 b6b8 d2a2 b8b6 a2d2 b6b8 g1e3 b8b6 c3c1 e4f3 e2f1 b6b8 f1f2 b8b6 c1e1 f3e4 e1d1 e4f3 d1a1 b6b8 f2g1 f3e4 g1f2 b8b6 a1a2 e4f3 a2b2 b6b8 b2b3 b8b6 b3d3 f3e4 d3b3 e4f3 b3b1 f3e4 b1e1 e4f3 f2g1 f3e4 e1f1 e4g2 f1e1 g2f3 g1f2 f3e4 e1f1 e4f3 f2g1 f3e4 g1f2 e4f3 f1g1 f3e4 g1d1 e4f3 f2e1
info string f3d1  (1321) N:     799 (+ 1) (P: 92.58%) (Q: -0.02764) (U: 0.03920) (Q+U:  0.01156) (V:  0.1661) 

iCE
position startpos moves e2e4 c7c6 g1f3 d7d5 e4e5 c6c5 f1e2 b8c6 e1g1 c8g4 c2c4 d5c4 b1a3 e7e6 a3c4 f8e7 d2d3 g8h6 c1h6 g7h6 d1d2 h6h5 d2f4 h8g8 f1e1 d8d7 a1d1 e8c8 f4f7 h7h6 f7h7 h5h4 h7h6 d8f8 h6h7 c8b8 c4e3 g4f3 e2f3 c6e5 f3e4 e5f7 f2f4 d7c7 e1f1 e7f6 d1e1 f7d6 h7c7 b8c7 b2b3 f6d4 g1h1 b7b5 e1e2 a7a5 e3c2 d4b2 e4f3 c7d7 c2e3 b2d4 a2a4 b5a4 b3a4 f8f4
info string f3c6  (592 ) N:     746 (+ 1) (P: 10.44%) (Q:  0.13812) (U: 0.00474) (Q+U:  0.14285) (V:  0.1866) 

Bobcat
position startpos moves d2d4 d7d5 g1f3 c7c6 c2c4 g8f6 b1c3 d5c4 a2a4 c8f5 e2e3 e7e6 f1c4 b8d7 d1b3 d8b6 a4a5 b6b3 c4b3 f5d3 b3d1 f8d6 d1e2 d3g6 e1g1 e8g8 c1d2 h7h6 f1c1 a7a6 c3a4 f6e4 d2e1 f8e8 g1f1 a8d8 f3d2 e4d2 e1d2 e6e5 d4e5 d7e5 d2c3 e5d7 c1d1 d6e7 a1c1 d7f6 c3d4 f6d7 d4c3 d7f6 c3d4 f6d7 h2h3 g6f5 e2d3 f5e6 d3c4 e6f5 f2f3 c6c5 d4c3 e7g5 g2g4 f5e6 c4e6 e8e6 f3f4 g5e7 f1e2 e6c6 d1d5 c6d6 d5d6 e7d6 c1d1 d6e7 b2b3 f7f6 h3h4 g8f7 h4h5 f7e8 e3e4 d8c8 e2d3 c8c6 d3c4 c6e6 d1e1 e6c6 e4e5 f6e5 f4e5 d7f8 a4b6 f8e6 c4d5 e6c7 d5e4 e8f7 e4f5 g7g6 f5e4 c7b5 c3d2 b5d4 e1b1 d4e2 b1f1 f7e8 f1f3 g6h5 g4h5 e2d4 f3g3 e7f8 b6c4 e8f7 e4d5 d4b5 g3d3 f7e8 d2e3 c6c7 c4d6 b5d6 e5d6 f8d6
info string e3h6  (561 ) N:     563 (+ 0) (P: 11.17%) (Q:  0.33399) (U: 0.00672) (Q+U:  0.34071) (V:  0.6047) 

Houdini
position startpos moves d2d4 e7e6 c2c4 f8b4 c1d2 b4e7 e2e4 d7d5 e4e5 c7c5 d1g4 e7f8 d4c5 h7h5 g4g3 h5h4 g3a3 b8d7 g1f3 f8c5 b2b4 c5b6 d2g5 g8e7 a3b2 h8h5 c4d5 e6d5 f1b5 e8f8 e1g1 d7e5 b2e5 f7f6 e5f4 b6c7 f4e3 f6g5 b1c3 d8d6 b5d3 c7b6 e3e2 h4h3 f1e1 g5g4 f3e5 h5g5 e5g6 g5g6 d3g6 c8d7 g6h5 a8c8 a1c1 c8c4
info string e2e7  (330 ) N:     492 (+ 1) (P:  4.66%) (Q:  0.76301) (U: 0.00320) (Q+U:  0.76621) (V: -0.6224) 

Naum
position startpos moves d2d4 f7f5 g1f3 e7e6 g2g3 b8c6 f1g2 g8f6 e1g1 d7d5 c2c4 d5c4 d1a4 c8d7 a4c4 f8d6 b1c3 e8g8 c1g5 h7h6 g5f6 d8f6 e2e4 c6a5 c4e2 f6g6 a1d1 g6h5 e4e5 d6b4 d4d5 b4c3 b2c3 a8d8 f1e1 c7c5 c3c4 h5g4 d1c1 f5f4 h2h3 g4g6 g3g4 h6h5 f3h2 h5g4 h2g4 g6g5 g1h2 d8e8 g2f3 g8h8 e1g1 g5h4 e2d2 b7b6 c1c3 e6d5 f3d5 h4h5 c3f3 a5c6 e5e6 c6d4 e6d7
info string h5d5  (882 ) N:     435 (+ 1) (P:  4.34%) (Q: -0.28057) (U: 0.00337) (Q+U: -0.27720) (V: -0.6316) 

Scorpio
position startpos moves g1f3 d7d5 e2e3 c7c5 d2d4 g8f6 c2c4 c5d4 e3d4 b8c6 c4d5 f6d5 b1c3 g7g6 f1c4 d5b6 c4b3 f8g7 e1g1 e8g8 d4d5 c6a5 f1e1 a5b3 a2b3 c8g4 h2h3 g4f3 d1f3 f8e8 c1e3 g7c3 e3b6 d8b6 b2c3 b6b3 a1b1 b3a3 b1b7 a7a5 b7d7 a8d8
info string e1e7  (120 ) N:     726 (+ 1) (P: 17.35%) (Q:  0.19556) (U: 0.00808) (Q+U:  0.20365) (V:  0.1929) 

Protector
position startpos moves d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 d5c4 e2e4 f8b4 c1g5 h7h6 g5f6 d8f6 f1c4 c7c5 e1g1 c5d4 e4e5 f6d8 d1d4 d8d4 f3d4 e8e7
info string d4f5  (763 ) N:       0 (+ 0) (P:  0.65%) (Q: -0.87489) (U: 0.22101) (Q+U: -0.65388) (V:  -.----) 

Vajolet
position startpos moves e2e4 c7c5 g1e2 b8c6 d2d4 c5d4 e2d4 d7d6 c2c4 e7e5 d4c2 f8e7 b1c3 g8f6 f1e2 c8e6 e1g1 e8g8 b2b3 a8c8 c1e3 f6d7 d1d2 f7f5 e4f5 e6f5 a1d1 d8e8 e2d3 f5e6 f2f4 d7f6 f4f5 e6f7 d3e2 c8d8 c3d5 b7b6 g2g4 f7d5 c4d5 c6b8 c2a3 d8c8 a3b5
info string f6g4  (590 ) N:     109 (+ 0) (P: 11.63%) (Q: -0.40228) (U: 0.03585) (Q+U: -0.36642) (V: -0.2968) 

Cheng
position startpos moves e2e4 e7e6 c2c4 d7d5 c4d5 e6d5 e4d5 g8f6 f1b5 c8d7 b5c4 d8e7 g1e2 e7e4 d2d3 e4g2 h1g1 g2h2 c1f4 h2h5 d1b3 b7b5 c4b5 f6d5 g1g5 h5h1 g5g1 h1h5 f4g5 d5b6 b1c3 f8d6 e1c1 a7a6 b5d7 b8d7 c3e4 e8g8 e2c3 d6h2 g1h1 d7e5 f2f4 e5f3 d1f1 f3g5 f4g5
info string h2f4  (1582) N:       0 (+ 0) (P:  0.07%) (Q: -1.50398) (U: 0.02522) (Q+U: -1.47876) (V:  -.----)

If using the default match settings for cpuct and softmax, 11089 finds all except one:

11089          sctr info string a4h4  (666 ) N:     624 (+ 1) (P:  2.41%) (Q:  0.61425) (U: 0.00369) (Q+U:  0.61794) (V:  0.1151) 
11089          wasp info string e6e3  (560 ) N:     660 (+ 1) (P:  4.07%) (Q:  0.61644) (U: 0.00591) (Q+U:  0.62235) (V:  0.3566) 
11089       exchess info string d5f6  (751 ) N:     669 (+ 1) (P:  3.70%) (Q:  0.15048) (U: 0.00530) (Q+U:  0.15578) (V:  0.3750) 
11089 hakkapeliitta info string f3d1  (1321) N:     699 (+ 1) (P: 36.16%) (Q:  0.00541) (U: 0.04957) (Q+U:  0.05498) (V:  0.1661) 
11089           ice info string f3c6  (592 ) N:     693 (+ 1) (P:  8.62%) (Q:  0.25334) (U: 0.01193) (Q+U:  0.26526) (V:  0.1866) 
11089        bobcat info string e3h6  (561 ) N:     550 (+ 0) (P: 11.58%) (Q:  0.45141) (U: 0.02020) (Q+U:  0.47161) (V:  0.6047) 
11089       houdini info string e2e7  (330 ) N:     715 (+ 1) (P:  5.11%) (Q:  0.66406) (U: 0.00686) (Q+U:  0.67092) (V: -0.6224) 
11089          naum info string h5d5  (882 ) N:     717 (+ 1) (P:  5.22%) (Q:  0.00369) (U: 0.00697) (Q+U:  0.01067) (V: -0.6316) 
11089       scorpio info string e1e7  (120 ) N:     655 (+ 1) (P: 12.00%) (Q:  0.30091) (U: 0.01755) (Q+U:  0.31846) (V:  0.1929) 
11089     protector info string d4f5  (763 ) N:     484 (+ 0) (P:  2.01%) (Q:  0.11955) (U: 0.00397) (Q+U:  0.12353) (V: -0.0408) 
11089       vajolet info string f6g4  (590 ) N:     262 (+ 1) (P:  9.53%) (Q: -0.30616) (U: 0.03468) (Q+U: -0.27148) (V: -0.2968) 
11089         cheng info string h2f4  (1582) N:       0 (+ 0) (P:  0.91%) (Q: -1.46775) (U: 0.87181) (Q+U: -0.59594) (V:  -.----)

And here's the result with latest test20:

self-play settings
20633          sctr info string a4h4  (666 ) N:       0 (+ 0) (P:  0.46%) (Q: -1.26248) (U: 0.15508) (Q+U: -1.10740) (V:  -.----) 
20633          wasp info string e6e3  (560 ) N:       0 (+ 0) (P:  1.31%) (Q: -0.71427) (U: 0.44550) (Q+U: -0.26876) (V:  -.----) 
20633       exchess info string d5f6  (751 ) N:       0 (+ 0) (P:  1.63%) (Q: -1.29837) (U: 0.55381) (Q+U: -0.74456) (V:  -.----) 
20633 hakkapeliitta info string f3d1  (1321) N:     796 (+ 2) (P: 86.87%) (Q:  0.09656) (U: 0.03688) (Q+U:  0.13343) (V:  0.0620) 
20633           ice info string f3c6  (592 ) N:     751 (+ 1) (P: 22.72%) (Q:  0.21161) (U: 0.01024) (Q+U:  0.22185) (V:  0.2895) 
20633        bobcat info string e3h6  (561 ) N:       0 (+ 0) (P:  0.57%) (Q: -1.55809) (U: 0.19403) (Q+U: -1.36406) (V:  -.----) 
20633       houdini info string e2e7  (330 ) N:       0 (+ 0) (P:  1.46%) (Q: -0.88607) (U: 0.49545) (Q+U: -0.39062) (V:  -.----) 
20633          naum info string h5d5  (882 ) N:       0 (+ 0) (P:  0.39%) (Q: -1.54053) (U: 0.13308) (Q+U: -1.40745) (V:  -.----) 
20633       scorpio info string e1e7  (120 ) N:       0 (+ 0) (P:  1.11%) (Q: -0.96234) (U: 0.37770) (Q+U: -0.58464) (V:  -.----) 
20633     protector info string d4f5  (763 ) N:       0 (+ 0) (P:  0.51%) (Q: -0.84197) (U: 0.17235) (Q+U: -0.66962) (V:  -.----) 
20633       vajolet info string f6g4  (590 ) N:       0 (+ 0) (P:  1.11%) (Q: -0.97088) (U: 0.37576) (Q+U: -0.59512) (V:  -.----) 
20633         cheng info string h2f4  (1582) N:       0 (+ 0) (P:  0.39%) (Q: -1.20504) (U: 0.13392) (Q+U: -1.07112) (V:  -.----) 

match settings
20633          sctr info string a4h4  (666 ) N:     540 (+ 1) (P:  1.94%) (Q:  0.39817) (U: 0.00345) (Q+U:  0.40162) (V: -0.2688) 
20633          wasp info string e6e3  (560 ) N:     394 (+ 0) (P:  3.25%) (Q:  0.29576) (U: 0.00791) (Q+U:  0.30368) (V:  0.4163) 
20633       exchess info string d5f6  (751 ) N:     567 (+ 1) (P:  3.48%) (Q:  0.02776) (U: 0.00588) (Q+U:  0.03363) (V:  0.3233) 
20633 hakkapeliitta info string f3d1  (1321) N:     687 (+ 1) (P: 30.92%) (Q:  0.08999) (U: 0.04313) (Q+U:  0.13312) (V:  0.0620) 
20633           ice info string f3c6  (592 ) N:     618 (+ 1) (P: 10.49%) (Q:  0.13358) (U: 0.01626) (Q+U:  0.14984) (V:  0.2895) 
20633        bobcat info string e3h6  (561 ) N:     476 (+ 1) (P:  3.95%) (Q:  0.36131) (U: 0.00795) (Q+U:  0.36925) (V:  0.4301) 
20633       houdini info string e2e7  (330 ) N:     621 (+ 1) (P:  3.09%) (Q:  0.38665) (U: 0.00476) (Q+U:  0.39141) (V: -0.2205) 
20633          naum info string h5d5  (882 ) N:     509 (+ 1) (P:  1.66%) (Q: -0.08083) (U: 0.00313) (Q+U: -0.07770) (V: -0.4912) 
20633       scorpio info string e1e7  (120 ) N:     555 (+ 1) (P:  3.31%) (Q:  0.28537) (U: 0.00571) (Q+U:  0.29108) (V: -0.2923) 
20633     protector info string d4f5  (763 ) N:     317 (+ 1) (P:  1.43%) (Q:  0.02249) (U: 0.00431) (Q+U:  0.02679) (V: -0.2807) 
20633       vajolet info string f6g4  (590 ) N:      52 (+ 0) (P:  2.70%) (Q: -0.21902) (U: 0.04899) (Q+U: -0.17003) (V: -0.1830) 
20633         cheng info string h2f4  (1582) N:      68 (+ 0) (P:  1.60%) (Q: -0.29140) (U: 0.02222) (Q+U: -0.26918) (V: -0.3185)

Interesting to see how different the initial network V can be from the searched Q in these positions.