Lc0-Test10/30/40/AF 6-man perpetual MCTS-related misevaluation Cause&Fix

EDIT: I have clarified this post so that others can see what I'm seeing in this post, particularly in the section with the visuals. EDIT2: I have made another discovery as of April 13th with the assistance of Faroe22 on the Leela discord. The problem is not a pure MCTS problem, but indeed it is also tied to the neural network's Q values. See my new comment down below. Hello, the following FEN is a 6-man position evaluated incorrectly by Lc0, no matter what neural network you use (really, all neural networks fail.). I had careless25 on the lczero discord test a 6-man position and another very similar 7-man position. Lc0 as an engine could not evaluate it as a draw. I finally have a concrete hypothesis as to what the problem is, and a guide for 2 proper solutions to fix it. (Certainty propagation does not fix this.)

7k/1RR4P/8/8/3K4/3r4/8/8 w - - 0 1

leela evaluates this position wrong

The perpetual stalemate threat is supposed to prevent white from ever accomplishing this mate in two (or one if the black rook moves to the a or b files without the white king to block it).

You can compare the PV in the searches below to what Leela comes up with in this Lichess study. The lichess study also includes a 7-man position that she horribly misevaluates. The 5-man she evaluates correctly because it is a five-man position, and she is given 5-man tablebases. https://lichess.org/study/pr2yCu1e 250k nodes info depth 8 seldepth 17 time 13814 nodes 254017 score cp 1411 hashfull 265 nps 18388 tbhits 0 pv d4e5 d3d8 e5f6 d8d6 f6e7 d6e6 e7d7 e6d6 d7c8 d6b6 b7a7 b6b7 c8b7 3M nodes info depth 10 seldepth 22 time 132587 nodes 3339570 score cp 1345 hashfull 1000 nps 25187 tbhits 0 pv d4e5 d3e3 e5d5 e3d3 d5e6 d3d6 e6e5 d6e6 e5d4 e6d6 d4e3 d6d4 e3f3 d4d3 f3f4 d3d8 b7a7 d8f8

You might want to note that at only very low nodes, leela evaluates this 6-man position properly. Here it is as demonstration to the issue, where leela's MCTS will initially give the correct evaluation, but will fail after 10-100 some nodes evaluating into the position.

NOTE: These PV's only are a representation of the search (which even still are telling), of course not the entire search tree itself.

kxd3 evaluation Low nodes. The NN's initial thinking at 1 ply is that all positions lead to a draw. Note that this particular PV isn't an 'incorrect' continuation for white, it is just a simple one. This is not an issue. info depth 1 seldepth 2 time 264 nodes 7 score cp 0 hashfull 0 nps 26 tbhits 0 pv d4d3 info depth 1 seldepth 2 time 1004 nodes 11 score cp 0 hashfull 0 nps 10 tbhits 0 pv d4d3 kxc3 evaluation Here, MCTS begins to misevaluate the position at 7700 centipawns in the evaluations below and finds non-disproven wins in her search for white. After white moves, there are 14 replies for black; 12 black loses, 2 black draws. Even though she finds one right continuation in the PV, there are so few of those in the MCTS search that MCTS returns a nonsense evaluation. info depth 3 seldepth 3 time 3338 nodes 31 score cp 7700 hashfull 0 nps 5 tbhits 0 pv d4c4 d3c3 c4c3 info depth 3 seldepth 4 time 3377 nodes 40 score cp 731 hashfull 0 nps 8 tbhits 0 pv d4c4 d3c3 c4c3

kxd5 evaluation Another drawn continuation while having a wildly incorrect 807-1412 centipawn evaluation. info depth 3 seldepth 4 time 5880 nodes 76 score cp 1412 hashfull 0 nps 11 tbhits 0 pv d4c5 d3d5 c5d5 info depth 3 seldepth 4 time 8931 nodes 102 score cp 807 hashfull 0 nps 10 tbhits 0 pv d4c5 d3d5 c5d5

incorrect rd8 evaluation continuation The search incorrectly continues with kb5 instead of going for the win, allowing black to continue the stalemate threat. incorrect kb5 evaluation continuation Overall, she finds d3d5 as black, another drawing move. But, this time, her PV displays d5d8, which is a white win. This is a perfect example of the things that go wrong in her search; because there are so many wrong black moves in her search that lead to a white win, she will assume it's a white win and average these drastically incorrect evaluations while not finding drawn move yet. With reasoning and evidence like this, it appears that it is indeed these incorrect continuations plaguing the evaluation. info depth 4 seldepth 5 time 5428 nodes 154 score cp 1029 hashfull 0 nps 9 tbhits 0 pv d4c5 d3d5 c5b4 d5d8 b4b5

2 incorrect moves rd7 kf5 evaluation continuation Now engine plays for her a silly move on the reply in her search, d3d7, but white doesn’t punish the mistake in this PV since the 2 possible moves for victory punishing it weren’t searched, Rb8 and Rc8 were only two moves of a possible 24 moves. The yellow circles represent white's options to move the piece and win the position. info depth 4 seldepth 6 time 9232 nodes 268 score cp 1071 hashfull 0 nps 17 tbhits 0 pv d4e4 d3d7 e4f5 d7f7 f5g6

kxb5 continuation Soon enough she finds a ‘correct’ continuation, but the MCTS averaging other positions still won't recognize it as a draw. info depth 4 seldepth 6 time 19849 nodes 338 score cp 1207 hashfull 1 nps 11 tbhits 0 pv d4c5 d3d5 c5b6 d5b5 b6b5

2 incorrect moves rd8 ke6 evaluation continuation But she goes deeper and finds another ridiculous continuation 6 ply in, d6d8. info depth 5 seldepth 8 time 120 nodes 1527 score cp 2441 hashfull 3 nps 3200 tbhits 0 pv d4e5 d3d5 e5e6 d5d6 e6f5 d6d8 f5e6

You'll notice that it's closer to the front of her search that something goes wrong and she evaluates it incorrectly. She finds so many of these incorrect continuations at the front of her search that the MCTS will average the wins and the draws, correct drawn continuations becoming flooded out by incorrect evaluations at the forefront of her search. If you're careful enough you'll notice that her PV output is in odd numbers each time, (except when it gets to really high nodes like the 3 million node example where she goes deeper into some lines not yet noticing that there are mistakes made earlier) because her search is playing the wrong moves on black's behalf, while white can only take what black gives. Due to the observations I've made, my hypothesis is that leela's neural network is evaluating most of these individual positions that she searches correctly because tablebases have rescored her NN to understand them, but there is something deliberately wrong with the search itself. The outputs are wrong because she always finds a continuation--a deviation at the end of her search that indicates the position is not a draw. Because there are so many deviations, the ultimate output by the Lc0 engine is higher than it is supposed to be.

mcts-related evaluation failiure without intervention

How do you fix this? We want to reduce the problem so that we aren't thinking of an infinite amount of fixes. There are zero and non-zero ways to fix this. Because it's an issue in MCTS, there should be something about MCTS we should add to and change. Therefore, we can look for a computationally inexpensive solution by changing the MCTS to recognizing that the search perpetually finds incorrect continuations by looking for a whole line of 'drawn-evaluated-positions' all the sudden changing to won positions at the end of the search.

You can go down the path of trying to write some complex evaluation function that evaluates the tree as a whole (even a NN). This is impractical and computationally expensive You could even 'simplify' the whole matter and look for repeating patterns in the search, like those demonstrated in the diagram. These are still impractical and will have far too many bugs, not only that, but are non-zero.

Instead in a far simpler course of action, you can assign a new value in the MCTS that represents stalemates, and create a value 'proven-stalemate frequency' which can be used to guide the MCTS itself. If a repeating pattern of stalemates are found earlier in the search, then you can tell the MCTS with a 'stalemate' value to continue searching for stalemate lines to make sure that it's not missing them. Otherwise, MCTS won't first try to prove that any proper continuation by black doesn't lead to something drawish.

Perhaps a more solidly 'zero' and general solution for any kind of perpetual would be to train the NN to recognize, predict, and output what kind of outcome will happen. A '50-move-rule%' outcome, a '3-fold-repitition%' outcome, a 'stalemate%' and a 'non-drawn%.' In a way, this is a new kind of 'policy' in a more general form that can guide how the MCTS itself searches based upon how many occurrences there are of those kinds of evaluations. These new evaluation values should further hint at what lines are 'incorrect' and negligible while MCTS can use that accordingly.

Some day I can start working on such code and work out technicalities, but as of now, I'm putting this problem, my thinking, and 2 thought-of solutions out there in hopes that someone can write the code to solve this long-standing problem since the beginning of the project before me to improve Leela even more.

I don't think your analysis is correct... you say 'She finds so many of these incorrect continuations' - but those are correct continuations! there is a possibility she can escape the draw... at least that is her hope. The problem is that there are so many permutations and there is always another path that opens up another huge branch.

The reason SF find it is probably because it prunes traspositions better and it is so much faster. On my laptop it took it a few seconds to get to 0.00. meaning probably a few million nodes. This is a good test for my transposition pruning test... but it probably still has way too many permutations for it to help leela

EDIT: I've clarified relevant parts of my post so that it makes it easier for people to see what I see.

Perhaps we should get terminology clear. By 'continuations' I mean a move that has been made by one side that follows from a position. By 'Incorrect continuation' I mean that the move gives the other side a chance to win the game. If you'll take a look at the positions with red and yellow arrows, that shows a sequence of incorrect continuations, and it's no coincidence mistakes were made by both sides in each of those lines the PV shows.

Also, SF doesn't even need the entire search tree of 50 moves to figure out that it's a draw. I'm pretty sure SF finds a way to determine if it is/isn't a draw by looking deliberately for opportunities towards certain squares in certain reachable positions.

I've recently had a talk with one of the contributors to this project and he has made it clear that I don't have a clear-cut implementation of the specialized draw% outcomes in the above post. I will modify and edit my post when I can figure out how to modulate equation 5 in one of the alphago papers http://airesearch.com/wp-content/uploads/2016/01/deepmind-mastering-go.pdf so something more concrete is given as a fix. I did however make progress on what exactly should be implemented/changed.

There are a couple of things to note here: (1) as @shinbet pointed out its the sheer number of losses in countless transpositions that inflate the score. (2) the only real fix to this is in-tree transposition handling. Thats whats killing UCT efficiency here. (3) CP helps as the score decreases faster to zero, but still takes an unreasonably long time. Current versions of PR487 and PR700 do not use pruning schemes (these would operate like a minimax backup operator). I had those in place at one time, but removed them for code simplification (penalizing visiting certain losing moves with increasing N). Maybe i will bring them back and use this as a test position in the future. (4) Even though leela plays the correct move, i can see this leading to problems if leela decides to go for this position from afar thinking that its winning.

I think it's a valid description of what's happening. But I don't think it's stalemate specific. The same thing happens e.g. with perpetuals. As PV goes deeper and deeper, some non optimal moves of losing side get some visits, and they participate in subtree average. As more visits go to "correct" move, the same happens one level deeper and so on:

Note that this issue exists even if value head is extremely good in detecting perpetuals, stalemate, fortresses (e.g. returns V=0 for them), etc, as long as there's long chain of moves and not just a terminal node.

I cannot easily think about good solution for this now.

CP won't help, as certain position is never reached.
Handling transpositions while helps to reduce the problem (by exhausting possible positions quicker), doesn't seem to address a root cause of the problem.

Maybe something like "backpropagate V only when it's visited for the second time (so when it's not a leaf)" would help with that? Just a random idea from 5 seconds ago.

I cannot easily think about good solution for this now.

CP won't help, as certain position is never reached.

Handling transpositions while helps to reduce the problem (by exhausting possible positions quicker), doesn't seem to address a root cause of the problem.

Maybe something like "backpropagate V only when it's visited for the second time (so when it's not a leaf)" would help with that? Just a random idea from 5 seconds ago.

Could give nodes from greater depth more value than shallower nodes. For example, change the current q calculation from q += (v_n - q)/n to

d += d_n
q += (d_n*v_n - q)/(n+d)

where d_n is the depth compared to the current node of the node that produced v_n and d is the sum of all the returned depth values, which would require additional memory to store the d value.

That would be really bad. The higher the depth, the less likely a move is to be relevant.

That would be really bad. The higher the depth, the less likely a move is to be relevant.

Why do you say that? I'm trying to construct an equation that makes higher depth count more than lower depth. So a value from a child node at depth 1 would count the same as it always has, but a value from a node at depth 15 would be equivalent to the 15 child nodes of depth 1.

I double counted n in the my last post, so the equation should be:

d += d_n
q += (d_n*v_n - q)/(d)

the chance that a position is relevant to the eval is the product over depths of the move needed to get to that position is optimal. This means that the higher the depth, the less certain you are that the move you are evaling should be counted in the eval.

True, but the greater the depth the more accurate the eval should be, which is what I'm trying to capture with that equation. There's a trade-off between likelihood of node being played in the game (shallow nodes) and accuracy of the eval (deep nodes).

I think that's only true in positions where the depth has provided simplification. For fortress like positions, it isn't true.

I think it's a valid description of what's happening. But I don't think it's stalemate specific. The same thing happens e.g. with perpetuals. As PV goes deeper and deeper, some non optimal moves of losing side get some visits, and they participate in subtree average. As more visits go to "correct" move, the same happens one level deeper and so on:

Thats why its a good thing to do two-fold draw scoring, it cuts down these possibilies by a sizable factor, especially for perpetuals.

Note that this issue exists even if value head is extremely good in detecting perpetuals, stalemate, fortresses (e.g. returns V=0 for them), etc, as long as there's long chain of moves and not just a terminal node.

Its about both policy and value. If policy does not put all probability mass on the narrow path, you will get these biases in conjuction with a averaging operator. One could gradually phase in a minimax operator depending on some condition. I think fersberry did some work on this, it might help here.

I cannot easily think about good solution for this now.
* CP won't help, as certain position is never reached.

A certain position is reached many times in the tree (2-folds, stalemates) or would be reached eventually (50 move rule). In the above case its pathological as white can avoid two-folds in countless ways making it to deep to reach 50 move rule draws, but in many other cases terminals are reached (even if only at the tip) and then cascade back through CP speeding up finding the correct move, while normal leea is still lost and revisits all the alternatives. Only one certain node at the end is enough to start that cascade. For fun try this position (lots of stalemates, perpetuals) with PR700, SF and normal leela:

8/8/8/1B6/6p1/8/4KPpp/3N2kr w - - 0 1

It takes PR700 (with CP+two-fold, NN as used in TCEC Sufi) about 45 seconds to find the win (1200kN) and 90 seconds to display mate in 11. Normal leela wants to play a drawing move after 200MN, when i aborted.

* Handling transpositions while helps to reduce the problem (by exhausting possible positions quicker), doesn't seem to address a root cause of the problem.
It depends what you define as root cause. This seems like a position where you need search to solve it, so additonal efficiency through transpositions allows you to reach relevant terminals in time to start the CP cascade.

Maybe something like "backpropagate V only when it's visited for the second time (so when it's not a leaf)" would help with that? Just a random idea from 5 seconds ago.

I don't know, it seems that would throw away an awful lot of information. Also if first expanded that node would not contribute to informing search about the path. One could artifically shorten the 50move rule, because leela rarely reaches that, maybe only in near endgame positions where root eval shows huge discrepencies to best moves Q.

I'm trying to construct an equation that makes higher depth count more than lower depth. So a value from a child node at depth 1 would count the same as it always has, but a value from a node at depth 15 would be equivalent to the 15 child nodes of depth 1. 1) I tried this early on by valuing later visits higher through a power function (and also various other functional forms). It worked at low visit searches, but not very well at high visit counts 2) Also note that Qs from deeper nodes are already weighted higher (collectively). With an effective branching factor of 2 (should roughly hold for chess) you get roughly half of all evals from the leaf nodes. So the (approximate) deepest level (collectively) is already roughly responsible for 50% of your best moves Q.

Faroe22 pushed go nodes 1 info depth --d4c5 (760 ) N: --d4e5 (762 ) N: --d4e4 (755 ) N: --d4d3 (749 ) N: --d4c4 (754 ) N: When info depth --d4c5 (760 ) N: --d4e5 (762 ) N: --d4e4 (755 ) N: --d4c4 (754 ) N: --d4d3 (749 ) N: info depth --d4d3 (749 ) N: --d4c4 (754 ) N: --d4e4 (755 ) N: --d4e5 (762 ) N: --d4c5 (760 ) N: me over a local minima today. There is a key component to the posted 6-man position that I was missing. Here is the verbose-move-stats on the initial position, given at 3 different amounts of nodes with stats to illustrate the issue (and even, upon even closer inspection, a possible second issue? But first things first...): 1 seldepth 1 time 283 nodes 1 score cp 0 hashfull 0 nps 3 tbhits 0 pv d4c4 0 (+ 0) (P: 19.06%) (Q: 0.99214) (D: 0.000) (U: 0.57168) (Q+U: 1.56382) (V: -.----) 0 (+ 0) (P: 19.81%) (Q: 0.99214) (D: 0.000) (U: 0.59438) (Q+U: 1.58652) (V: -.----) 0 (+ 0) (P: 20.22%) (Q: 0.99214) (D: 0.000) (U: 0.60665) (Q+U: 1.59879) (V: -.----) 0 (+ 0) (P: 20.44%) (Q: 0.99214) (D: 0.000) (U: 0.61324) (Q+U: 1.60539) (V: -.----) 0 (+ 0) (P: 20.47%) (Q: 0.99214) (D: 0.000) (U: 0.61416) (Q+U: 1.60630) (V: -.----) each move has been searched 1 seldepth 2 time 345 nodes 9 score cp 0 hashfull 0 nps 26 tbhits 0 pv d4d3 1 (+ 0) (P: 19.06%) (Q: 0.96872) (D: 0.000) (U: 0.80867) (Q+U: 1.77739) (V: 0.9687) 1 (+ 0) (P: 19.81%) (Q: 0.97207) (D: 0.000) (U: 0.84079) (Q+U: 1.81286) (V: 0.9721) 1 (+ 0) (P: 20.22%) (Q: 0.97560) (D: 0.000) (U: 0.85815) (Q+U: 1.83375) (V: 0.9756) 1 (+ 0) (P: 20.47%) (Q: 0.97352) (D: 0.000) (U: 0.86877) (Q+U: 1.84228) (V: 0.9735) 4 (+ 0) (P: 20.44%) (Q: 0.00000) (D: 1.000) (U: 0.34699) (Q+U: 0.34699) (V: 0.0000) (T) 4 seldepth 8 time 21050 nodes 1021 score cp 2645 hashfull 2 nps 48 tbhits 0 pv d4c5 d3d5 c5b6 d5b5 b6b5 19 (+ 0) (P: 20.44%) (Q: 0.00000) (D: 1.000) (U: 1.01038) (Q+U: 1.01038) (V: 0.0000) (T) 189 (+72) (P: 20.47%) (Q: 0.90691) (D: 0.085) (U: 0.07724) (Q+U: 0.98416) (V: 0.9735) 206 (+56) (P: 20.22%) (Q: 0.91266) (D: 0.068) (U: 0.07601) (Q+U: 0.98867) (V: 0.9756) 275 (+158) (P: 19.81%) (Q: 0.94371) (D: 0.047) (U: 0.04513) (Q+U: 0.98884) (V: 0.9721) 331 (+89) (P: 19.06%) (Q: 0.94398) (D: 0.048) (U: 0.04475) (Q+U: 0.98872) (V: 0.9687)

Notice the V values. These V values are indicating that it is not just the Q that is the issue; it is actually the Neural Network evaluating these position incorrectly, even though its a 6-man position, and you'd expect it to be evaluated correctly by the NN.

Here's a line of code, line 254 in src/mcts/search.cc describing where V values come from; if (edge.IsTerminal()) { v = edge.node()->GetQ(); } else { NNCacheLock nneval = GetCachedNNEval(edge.node()); if (nneval) v = -nneval->q; }

Before, I had assumed: -The Neural Network was absolutely perfect, and had actually evaluated each Table-base position correctly at each node. -The '0 centipawns' evaluation was produced by a Neural Network evaluation

Now we know: -The NN incorrectly evaluates individual drawn 6-man positions at each node still. -The '0 centipawns' evaluation was actually produced by a line of code indicating stalemates were draws. -There is more going on than just the simple MCTS search issue.

The new perspective on this issue: -The problem lies both in training, where the the limited perspective of the NN and MCTS will not find the correct continuation (which Mardak can testify to, as he tested 800 nodes Leela -MCTS perpetual recognition is still poor. This could be fixed with cross communication MCTS search to the NN about how the search is perpetually not finding predicted forced checkmates. To increase efficiency, we need to reduce the # of searched positions for increased depth. Certainty propagation is our best bet on that by Videodr0me's tests.

Before, I didn't know how to do verbose-move-stats and botched this analysis. It seems that Leela just needs to generalize these positions better, which is not as easy as fixing the MCTS search issue.

Another issue: wasted nodes on terminal positions. If you'll look closely at the third set of evaluations given above, you'll see this: --d4d3 (749 ) N: 19 (+ 0) (P: 20.44%) (Q: 0.00000) (D: 1.000) (U: 1.01038) (Q+U: 1.01038) (V: 0.0000) (T) More specifically, N: 19 is what concerns me. 19 nodes have been wasted on this one particular position. It might be slowing Lc0 down a bit. It appears that lc0 has a way to recognize that nodes are terminal, such as that in src/mcts/search.cc line 267;

if (edge.IsTerminal()) oss << "(T) "; infos.emplace_back(oss.str()); Yet still not utilizing things like edge.IsTerminal() to prevent Lc0 from searching the same terminal nodes again.

I will be closing this issue in 24 hours and creating a new one next week, with a proper, comprehensive analysis of the problems we are encountering. Thanks all for your patience, and I'm glad this inspired ideas to improve Lc0's search method. With this issue, MCTS communication with the NN seems important.

EDIT as of May 26th: I've made this comment a bit shorter and easier to read. I'm coming back to this issue and analyzing it properly around July.

Closing without replacement because the nature of the current issues in larger nets are different, even though linked with this. MLH also may or may not fix this.

LeelaChessZero / lc0

Lc0-Test10/30/40/AF 6-man perpetual MCTS-related misevaluation Cause&Fix #742