LeelaChessZero / lczero-training

For code etc relating to the network training process.
149 stars 119 forks source link

Analyze blunders #5

Open dubslow opened 6 years ago

dubslow commented 6 years ago

From @mooskagh on May 8, 2018 7:23

Important!

When reporting positions to analyze, please use the following form. It makes it easier to see what's problematic with the position:

(old text below)

There are many reports on forums asking about blunders, and the answers so far had been something along the lines "it's fine, it will learn eventually, we don't know exactly why it happens".

I think at this point it makes sense to actually look into them to confirm that there no some blind spots in training. For that we need to:

Eventually all of this would be nice to have as a single command, but we can start manually.

For lc0, that can be done this way: --verbose-move-stats -t 1 --minibatch-size=1 --no-smart-pruning (unless you want to debug specifically with other settings).

Then run UCI interface, do command:

position startpos moves e2e4 ....

(PGN move to UCI notation can be converted using pgn-extract -Wuci)

Then do:

go nodes 10

see results, add some more nodes by running:

go nodes 20
go nodes 100
go nodes 800
go nodes 5000
go nodes 10000
and so on

And look how counters change.

Counters:

e2e4 N: 329 (+ 4) (V: -12.34%) (P:38.12%) (Q: -0.2325) (U: 0.2394) (Q+U: 0.0069)
 ^      ^    ^      ^           ^          ^            ^           ^
 |      |    |      |           |          |            |           Q+U, see below
 |      |    |      |           |          |           U from PUCT formula,
 |      |    |      |           |          |           see below.
 |      |    |      |           |         Average value of V in a subtree
 |      |    |      |          Probability of this move, from NN, but if Dirichlet
 |      |    |      |          node is on, it's also added here, 0%..100%
 |      |    |     Expected outcome for this position, directly from NN, -100%..100%
 |      |   How many visits are processed by other threads when this is printed.
 |     Number of visits. The move with maximum visits is chosen for play.
Move

* U = P * Cpuct * sqrt(sum of N of all moves) / (N + 1)
  CPuct is a search parameter, can be changed with a command line flag.
* The move with largest Q+U will be visited next

Help wanted:

Copied from original issue: glinscott/leela-chess#558

dubslow commented 6 years ago

From @Tilps on May 8, 2018 9:51

https://lichess.org/e07JvP6g - I started analyzing this - the position after the blunder has a 0.11% policy for a move which is checkmate. Takes 20k visits to get its first look and then it obviously gets every visit. I haven't tested how that varies with noise applied.

dubslow commented 6 years ago

From @Ghotrix on May 8, 2018 10:17

isn't position startpos fen ... more convenient for this case?

dubslow commented 6 years ago

Here's an easy one ply discovered attack tactic missed by Leela after 2K5 nodes. Position: https://lichess.org/mbWjiT93#105 Twitch recording of the thinking time/engine output: https://clips.twitch.tv/GenerousSmellyEggnogPunchTrees And as the lichess analysis says, this was "merely" the cherry on top of the multiple-mistakes cake. How to swing 15 points' eval in just 3 moves!

Further analysis requested please. How many playouts until Leela even once searches the tactic?

Edit: Tilps' position is also a discovery bug, I think Leela's policy assumes that the rook can just capture the queen, which is of course prevented by the pin = discovered attack

dubslow commented 6 years ago

From @hsntgm on May 8, 2018 11:46

@mooskagh thanks for diagram.

If i wrong please correct me.

Leela's brain gets power from memorized games and positional samples she collected in self play and we call it visits.I see she has visits comes from weights instead of alfa-beta pruning.If there is a tactical opportunity in the position but leela visits an other move much she choose it.

In basic tactical positions occurs suddenly in the game and the hardest part is to teach her this. Is it necessary to play billions of games in order to learn the tactical motifs that occur during the game?

Or can you add a simple tactical search algorithm triggers on every move working independently from visits for a while.After she find tactical move with tactical search algorithm(looks for suddenly jumps to +1 +2 etc) and enter this move tree she can collect this sample to her brain too.With this way she learn playing tactically in short time and tune herself automatically.

e2e4 N: 329 (+ 4) (V: -12.34%) (P:38.12%) (Q: -0.2325) (U: 0.2394) (Q+U: 0.0069)
 ^      ^    ^      ^           ^          ^            ^           ^
 |      |    |      |           |          |            |           Q+U, see below
 |      |    |      |           |          |           U from PUCT formula,
 |      |    |      |           |          |           see below.
 |      |    |      |           |         Average value of V in a subtree
 |      |    |      |          Probability of this move, from NN, but if Dirichlet
 |      |    |      |          node is on, it's also added here, 0%..100%
 |      |    |     Expected outcome for this position, directly from NN, -100%..100%
 |      |   How many visits are processed by other threads when this is printed.
 |     Number of visits. The move with maximum visits is chosen for play.
Move

* U = P * Cpuct * sqrt(sum of N of all moves) / (N + 1)
  CPuct is a search parameter, can be changed with a command line flag.
* The move with largest Q+U will be visited next
dubslow commented 6 years ago

From @chara1ampos on May 9, 2018 5:49

I am stating the obvious, but I think that brute force engines like Stockfish and Houdini have the advantage that their evaluation is cheap, and they can search very deep, thus having great tactics.

Leela's evaluation is very expensive, and thus she cannot search deep enough to avoid blunders. I sense that if one could speed up her evaluation, so she could search deeper, her blunders would be greatly reduced.

On an Nvidia Titan V, where Leela cudnn can evaluate 8000 nodes per second, she did not seem to blunder, and even won several games against Stockfish, Komodo and Houdini: https://groups.google.com/forum/#!topic/lczero/YFMOPQ-J-q4

I recall that alpha zero evaluated around 100000 nodes per second on the deep mind supercomputer, which greatly improves its tactics. This begs the question: what nps did alpha zero use during its training process? I suspect the number of nps can greatly affect the quality of the games during Leela's training. If the cudnn version of Leela can be used for training, the quality and speed of training will likely be increased drastically.

dubslow commented 6 years ago

From @mooskagh on May 9, 2018 9:3

I've added a form for problematic positions submission into the original message. Sorry for bureaucracy, but that makes it much easier to see the problem.

dubslow commented 6 years ago

From @Ishinoshita on May 9, 2018 9:55

@chara1ampos : The DM paper says "During training, each MCTS used 800 simulations.", which is a bit ambiguous and may read as new playouts added to the tree or as visits for selected node. Thus nps is irrelevant (but for the total training time). 800 'simulations' is anyway far below 10K's of simulations you mention for match games. So, yes, AZC training may have included blunders as well, at least in early stages (like where we stand now).

dubslow commented 6 years ago

From @Why-Sensei on May 9, 2018 10:17

dubslow commented 6 years ago

From @hsntgm on May 9, 2018 14:44

@chara1ampos why anybody ask this question maybe Alpha zero just a auto tuned stockfish derivative with neural network.The traditional chess engines elo depends tuning parameters in their code.Maybe they just do that in neural network.

Stockfish 1.01 elo 2754 in 2008 Stockfish 9 elo 3444 in 2018

Look stockfish development history it gained only 700 elo in ten years with million cpu time and genius c programmers whose tuned parameters step by step.Now we wait Leela gains 500 elo with self play.Who knows maybe the road map is totally wrong.

Why i think that because someone says leela draws with stockfish ok very good news but how can you explain these blunders and tactical weakness 3000 elo program? Leela's skeleton formed after 10 million games there is no return and this is big paradox for project.

dubslow commented 6 years ago

From @Ishinoshita on May 9, 2018 16:53

"maybe Alpha zero just a auto tuned stockfish derivative with neural network" I'm afraid this is fully wrong, in at least:

dubslow commented 6 years ago

From @Why-Sensei on May 10, 2018 10:32

dubslow commented 6 years ago

From @mooskagh on May 11, 2018 8:18

Thanks for submitted the bug reports, they were very useful.

All the blunders so far can be explained by #576. The fix is there in client v0.10, but it will take multiple network generations to recover the network.

So for a few days (until at ~300000-500000 games are generated by v0.10 client and network is trained on that), don't submit any other positions, as they are likely caused by the same bug.

After that new blunder reports are very welcome!

dubslow commented 6 years ago

From @mooskagh on May 13, 2018 16:14

For now it would be the most interesting to see examples of blunders that appeared recently. E.g. if LCzero played correct move in network id270 and now blunders. That way we'd have some examples of what exactly it unlearns and could look into training data.

dubslow commented 6 years ago

From @TCECfan on May 14, 2018 17:29

ID: ID288CCLSGame65 Game: https://lichess.org/8mCbbkwl#240 Bad move: 121. Rc7. Correct Move: Many other moves Screenshot 1: image Screenshot 2 shows Analysis by ID288 in Arena on my machine: image Screenshot 3 shows Analysis by ID94 in Arena on my machine (Rc7 not listed):
image Configuration: CCLS Gauntlet Network ID: 288 Time control: 1 min + 1 sec (increment) Comment: Game was streamed on May 14th 2018.

dubslow commented 6 years ago

From @TCECfan on May 14, 2018 17:48

ID: ID288CCLSGame53 Game: https://lichess.org/0YJMfRI6#260 Bad move: 131. Ra6. Correct Move: 131. Ba1 (By Stockfish 9 on LiChess) Screenshot 1: image Screenshot 2 shows Analysis by ID288 in Arena on my machine: image Screenshot 3 shows Analysis by ID94 in Arena on my machine (Ra6 not listed): image Configuration: CCLS Gauntlet Network ID: 288 Time control: 1 min + 1 sec (increment) Comment: Game was streamed on May 14th 2018.

dubslow commented 6 years ago

From @mooskagh on May 14, 2018 18:9

Thanks posting, we are looking into those positions. Evaluation of this position is improved a lot in id291, which confirms the main explanation that we have now (value head overfitting).

dubslow commented 6 years ago

From @TCECfan on May 14, 2018 18:10

ID: ID288CCLSGame72 Game: https://lichess.org/CVYOwXSK Bad Evaluation: Drew by 3-fold repetition with an evaluation of +15.98 Screenshot: image Configuration: CCLS Gauntlet Network ID: 288 Time control: 1 min + 1 sec (increment) Comment: Game was streamed on May 14th 2018.

dubslow commented 6 years ago

From @TCECfan on May 14, 2018 18:13

There I lots more examples but I will stop here then :)

dubslow commented 6 years ago

From @TCECfan on May 14, 2018 18:52

I couldn't resist one more... ID: ID280CCLSGame7 Game: https://lichess.org/rWWqu4tx#98 Rh7 would end the game immediately by 3-fold repetition, but Leela played the losing move Kh3 instead: image Stockfish 9 gives Rh7 as the only move image Configuration: CCLS Gauntlet Network ID: 280 Time control: 1 min + 1 sec (increment) Comment: Does Leela handle 3-fold repetition correctly?

dubslow commented 6 years ago

From @apleasantillusion on May 14, 2018 22:11

Interestingly, on the Rc7?? Kxc7 and Ra6 Kxa6?? blunders above, I can reproduce them with ID288 on CPU both with game history.

With just FEN, while it doesn't play both blunders, the killing responses to both blunders are given very low probability from policy, so it's just dumb luck that the engine doesn't play the blunder.

The really interesting part is that with the FEN modified so the 50-move rule halfmove counter is set to 0, it immediately sees both killing moves with very high policy outputs.

This is also true of this recent match game: http://lczero.org/match_game/268131

With game history or FEN, 292 plays 132. Rc7??, giving the obvious capture response very, very low policy output.

With FEN altered so 50-move rule halfmove counter is set to 0, it immediately sees the capture with 99% probability from policy.

Maybe these examples are just lucky, but it seems high values for the 50-move rule halfmove counter correlate with very strange blunders.

dubslow commented 6 years ago

From @nelsongribeiro on May 14, 2018 23:26

http://lczero.org/match_game/268155

ID 292 blunders again against ID 233 near the 50-move rule coming up...

dubslow commented 6 years ago

From @trophymursky on May 14, 2018 23:29

interesting bit based off of apleasantillusion's comment (tho I'm using 292).

the fen for the interesting position is "2r5/R7/8/8/5k2/8/2K5/8 w - - 85 121" where the policy net ID292 has Rc7 (wrongfully) at 99.91%.

specifically if you set it to 60 half moves (instead of 85) the policy net fro Rc7 is at .07%. At 65 half moves it's at .2%, at 66 it's at .71%, 67 it's at 1.23%, 68 it's at 6.53% (no longer considered the worst move), 69 it's at 89.47 percent.

I have no idea why the inflection point would be anywhere near where it, but it's definitely interesting and points towards a training bug corrupting the policy net.

dubslow commented 6 years ago

From @so-much-meta on May 15, 2018 6:40

FYI... Regarding the a7c7 rook blunder above, I think this might be explained (partially) by https://github.com/glinscott/leela-chess/issues/607 EDIT: I guess this can be disregarded since someone confirmed that Arena and most GUIs do always send moves... Regardless, leaving this here because it is interesting to see the difference in policies with and without history.

Network 288..

With history: position fen 7r/8/R7/3k4/8/8/2K5/8 w - - 77 117 moves a6a5 d5e6 a5a6 e6f5 a6a5 f5f4 a5a7 h8c8 go nodes 1000 (==> This chooses Kb3) info string Kb2 -> 0 (V: 59.51%) (N: 0.29%) PV: Kb2 info string Kd1 -> 0 (V: 59.51%) (N: 0.90%) PV: Kd1 info string Kb1 -> 2 (V: 52.08%) (N: 1.86%) PV: Kb1 Ke3 Ra3+ info string Kd2 -> 5 (V: 59.96%) (N: 2.27%) PV: Kd2 Kf5 Rb7 Kf4 info string Kd3 -> 11 (V: 60.53%) (N: 9.70%) PV: Kd3 Ke5 Re7+ Kd6 Re8 Kd5 info string Rc7 -> 381 (V: 67.65%) (N: 80.40%) PV: Rc7 Rb8 Rb7 Kf5 Rxb8 Ke6 Kd3 Kd5 Rb5+ Kc6 Kc4 info string Kb3 -> 491 (V: 83.56%) (N: 4.58%) PV: Kb3 Ke5 Rc7 Kd6 Rxc8 Kd7 Rc5 Kd6 Kc4 Ke6 info string stm White winrate 76.24%

Without history: position fen 2r5/R7/8/8/5k2/8/2K5/8 w - - 85 121 go nodes 1000 (==> This chooses Rc7) info string Kd1 -> 0 (V: 61.53%) (N: 0.00%) PV: Kd1 info string Kd2 -> 0 (V: 61.53%) (N: 0.00%) PV: Kd2 info string Kb2 -> 0 (V: 61.53%) (N: 0.00%) PV: Kb2 info string Kb3 -> 0 (V: 61.53%) (N: 0.00%) PV: Kb3 info string Kd3 -> 0 (V: 61.53%) (N: 0.00%) PV: Kd3 info string Kb1 -> 0 (V: 61.53%) (N: 0.01%) PV: Kb1 info string Rc7 -> 500 (V: 70.63%) (N: 99.98%) PV: Rc7 Rb8 Rb7 Kf5 Rxb8 Ke6 Kd3 Kd5 Rb5+ Kc6 Kc4

dubslow commented 6 years ago

From @so-much-meta on May 15, 2018 7:35

As to the a7c7 blunder above, I think the history's only part of the problem... The other part of the issue is that the All Ones plane (last input plane) bug really messed up policies.

Good input data was being trained on a bad policy. Consider the effect of the negative log loss/cross entropy in these examples (non-buggy network with low outputs getting trained on a buggy high output).

Here's output from network ID 280. Notice that the a7c7 move only has high probability when the all ones input plane was buggy. Essentially, I think it was bad data like this that kept messing things up.

History + AllOnesBug Policy ('a7c7', 0.8687417), ('c2d3', 0.046122313), ('c2b3', 0.034792475), ('c2d2', 0.03021726), ('c2d1', 0.0111367665), ('c2b1', 0.006555821), ('c2b2', 0.0024336604), Value: 0.5331184417009354

History + NoBug ('c2d3', 0.47858498), ('c2b3', 0.13757008), ('c2d2', 0.13545689), ('c2d1', 0.08749167), ('c2b1', 0.08396132), ('c2b2', 0.07649834), ('a7c7', 0.000436759), Value: 0.5014338248874992

NoHistory + AllOnesBug Policy: ('a7c7', 0.99920577), ('c2d2', 0.00019510729), ('c2b3', 0.00015975242), ('c2d3', 0.00015850786), ('c2b1', 0.0001421545), ('c2d1', 7.9948644e-05), ('c2b2', 5.882576e-05)]), Value: 0.5555554553866386

NoHistory+NoBug ('c2d3', 0.34282845), ('c2b3', 0.22524531), ('c2d2', 0.14119184), ('c2b2', 0.09196934), ('c2d1', 0.09108826), ('c2b1', 0.08420463), ('a7c7', 0.023472117), Value: 0.49658756237477064

Now look how all of that changed by network 286, below - now the input with missing history is starting to show the bad policy:

History+AllOnesBug ('a7c7', 0.88481957), ('c2d3', 0.043222357), ('c2d2', 0.030274319), ('c2b3', 0.017787572), ('c2b1', 0.011131173), ('c2b2', 0.011077223), ('c2d1', 0.0016878309), 0.8049132525920868)

History+NoBug (OrderedDict([('c2d3', 0.35683072), ('c2b3', 0.17884524), ('c2d2', 0.15325584), ('c2b2', 0.1069537), ('c2d1', 0.10222348), ('c2b1', 0.10148263), ('a7c7', 0.00040832962)]), 0.5084156421944499)

NoHistory+AllOnesBug ('a7c7', 0.9984926), ('c2d3', 0.00064814655), ('c2b1', 0.00030561475), ('c2d2', 0.00022950297), ('c2b3', 0.00016663132), ('c2d1', 8.821991e-05), ('c2b2', 6.930062e-05)]), 0.8271850347518921)

NoHistory+NoBug ('c2b3', 0.35689142), ('a7c7', 0.227083), ('c2d2', 0.1410887), ('c2d3', 0.10505199), ('c2b1', 0.078001626), ('c2d1', 0.0670605), ('c2b2', 0.024822742)]), 0.49565275525674224)

By the time it got to network 288, the policy was really bad in this particular spot: History+AllOnesBug ('a7c7', 0.81777406), ('c2b1', 0.0735284), ('c2d3', 0.045673266), ('c2d2', 0.044812158), ('c2d1', 0.011020878), ('c2b3', 0.0059179077), ('c2b2', 0.0012732706), 0.9999993741512299)

History+NoBug ('a7c7', 0.8040016), ('c2d3', 0.0970014), ('c2b3', 0.04580218), ('c2d2', 0.022658937), ('c2b1', 0.018647738), ('c2d1', 0.008990083), ('c2b2', 0.0028980032), 0.5951071679592133

NoHistory+AllOnesBug ('c2b1', 0.30733383), ('a7c7', 0.25477663), ('c2d2', 0.19509505), ('c2d3', 0.17735933), ('c2d1', 0.037348717), ('c2b3', 0.02388807), ('c2b2', 0.004198352), 0.9999998211860657

NoHistory+NoBug ('a7c7', 0.99980253), ('c2b1', 6.103614e-05), ('c2d3', 4.706335e-05), ('c2b3', 3.6989695e-05), ('c2b2', 2.2621784e-05), ('c2d2', 1.6375083e-05), ('c2d1', 1.3423687e-05), 0.6152948960661888

Now, at network 294, this is the current situation (ignoring buggy input plane, as it's no longer relevant): History+NoBug ('c2d3', 0.32457772), ('c2b1', 0.19262017), ('c2d1', 0.15003791), ('c2b3', 0.12282815), ('c2d2', 0.10260171), ('c2b2', 0.08874603), ('a7c7', 0.018588383), 0.46542854234576225)

NoHistory+NoBug ('a7c7', 0.99916804), ('c2b1', 0.00017883514), ('c2d1', 0.00016860983), ('c2d3', 0.00016126267), ('c2b2', 0.00012590773), ('c2d2', 0.00010842814), ('c2b3', 8.8898094e-05)]), 0.43435238301754)

dubslow commented 6 years ago

From @gyathaar on May 15, 2018 12:26

Does it still blunder in those positions if you use --fpu_reduction=0.01 (instead of default 0.1) ?

dubslow commented 6 years ago

From @apleasantillusion on May 15, 2018 14:54

In the game nelsongribeiro posted, the same pattern holds true (tested with 292).

With history, she plays 124.Ke7 with a very high probability from policy (84.89%), and the response Qxd5 just taking the hanging queen is given only a 2.93% from policy.

Without history at the root, just FEN, she again plays Ke7 with high probability from policy (95.83%), and the Qxd5 response taking the hanging queen is given only 2.33% from policy.

With the FEN modified in only one way, setting 50-move rule counter to 0, Ke7's policy drops to 37.34%, and Qxd5 after Ke7 jumps to 95.07%

Now, from a purely objective standpoint in this particular position, none of this matters so much, since the position is losing to begin with, although forcing black to find the winning idea in the king and pawn ending is a much stronger way of playing than just hanging the queen.

Also, independently of that, the fact that taking a hanging queen is only ~2% from policy when the 50-move rule counter is high is a bit disturbing and is in line with the other examples I cited above.

In general, the variation in probability for Qxd5 based on the 50-move rule counter is quite odd.

In that exact position with black to move (6q1/4K2k/6p1/3Q1p1p/7P/6P1/8/8 b - - 0 0), here are probabilities for Qxd5 with different values of 50-move rule counter:

0: 68.26% 1: 76.71% 5: 89.40% 10: 91.63% 20: 92.48% 30: 94.28% 40: 89.57% 50: 77.83% 60: 83.95% 70: 52.39% 80: 66.84% 90: 11.43% 99: 1.06%

dubslow commented 6 years ago

From @nelsongribeiro on May 15, 2018 15:46

The really bad move is the move made just before that position: The FEN position is (4K3/6qk/3Q2p1/5p1p/7P/6P1/8/8 w - - 94 123).

Its a draw at this point, on move 123, ID 292 played 123. QD5 intead of 123. QE6

Last time that a pawn was moved was at 75...f5 , what makes this the 48th move after that.

EDIT: best move was wrong before..

dubslow commented 6 years ago

From @apleasantillusion on May 15, 2018 20:49

Just to add to this, the Rxf3+ that's being tracked in the sheet at https://docs.google.com/spreadsheets/d/1884-iHTzR73AgFm19YYymg2yKnwhtHdaGyHLVnffLag/edit#gid=0 shows the same behavior.

With net 297, probability with various 50-move counter rule values:

0: 0.07% 10: 1.88% 25: 10.23% 50: 1.11% 90: 1.72% 99: 6.97%

That's some heavy variation just from changing the 50-move rule counter.

Also, the pattern is different with this one. In all the others, probabilities were worst at the very high counts, a bit better at very low counts, and best at counts around 30. Here that last trend maintains, but the other is more muddled.

dubslow commented 6 years ago

From @ASilver on May 17, 2018 5:7

I don't know if it is a consequence of the bug, or the new PUCT values which inhibit tactics, but the latest versions (I am watching 303 right now) have some appallingly weak ideas of king safety and closed positions. I am playing a match against id223 at 1m+1s and it is more than a tactical weakness issue, it is one of completely wrong evaluations, which 223 did not have, that is leading it to happily let its king be attacked until it is too late to save itself. I also saw more than one case where it thought a dead drawn blocked game, with no entry or pieces, was +2, while 223 thought it about equal. The result was that 303 preferred to sacrifice a pawn or two to not allow a draw, and then lost quickly thanks to the material it gave away.

eval303vs223-01

id303 is white, and id223 is black. Both are playing with v10 (different folders) with default settings.

dubslow commented 6 years ago

From @TCECfan on May 17, 2018 6:29

@ASilver I have watched many games from many of the CCLS gauntlets, and my overall view (from the perspective of spectator) is that her style changed markedly following the release of v0.8. In particular, she started showing: Type 1: unstable or poor evaluations, and inflexible play.
Type 2: "buggy-looking" play in closed positions in which both engines are shuffling, in repeated positions (especially 3-fold), and in positions where the 50-move rule is key. In my view, Type 1 has "followed the trajectory of the value head", whereas Type 2 has seemed to persist even as "the value head has partially recovered".

dubslow commented 6 years ago

From @so-much-meta on May 17, 2018 7:38

Regarding the rook blunder above: position fen 7r/8/R7/3k4/8/8/2K5/8 w - - 77 117 moves a6a5 d5e6 a5a6 e6f5 a6a5 f5f4 a5a7 h8c8 (or just "2r5/R7/8/8/5k2/8/2K5/8 w - - 85 121")

I did some more analysis. I think some here might find it useful. For one, I didn't look closely at rule50 when I looked at this before. Also, I found evidence that filling in history panes with fake data (just copy current position, or oldest if some history is available) is probably best in case no history is available. Check out the attached PDFs with graphs.

What I did was I iterated the halfmove clock (rule50) from 0 through 91, created the FEN position at each halfmove clock, played each of the 8 moves, and determined the network value and policy under the following conditions: (1) Normal History (2) Fake History [All history planes are just copied from the first] (3) No History [All history planes are set to 0]

I did this with both net 288 and net 303. So that's 12 graphs. (halfmove clock is horizontal axis).

I also added the move a7c7 (the blunder move - where Leela does not want to take the free rook, so blunders again in the PV), and did the exact same process.

Some findings:

rook_blunder_Id288_Id303.pdf rook_blunder_Id288_Id303_a7c7.pdf

dubslow commented 6 years ago

From @mooskagh on May 17, 2018 8:20

The value graphs for network 303 are a bit misleading. It surprised me to have such a drop near ply 80, but then I realized that scale of Y axis is 0.45..0.49.

Would be more demonstrative if all graphs were with 0.0..1.0 scale.

Also, what is the correct move there? (there were several rook blunders in this thread, so it's not immediately clear which one you refer to).

dubslow commented 6 years ago

From @peterjacobi on May 17, 2018 9:1

Good examples for the longstanding problem with discovered attacks. Furunkel just posted in #game-analysis, but should be preserved here:

Re-looking at the recent ID300 gauntlet, discoveries are still a huge problem for Leela. In the first 50 games, Leela fell 6 times for a discovered attack. This was not always a game loser but still. I collected those 6 games if people want to use these for testing: https://pastebin.com/mffqHESV The critical moves in those were: game 1: 28.Rb7, game 2:23.Nd5, game 3: 35.Qe4 (ouch), game 4: 29.Rh6 (ouch), game 5: 30.Nc5, game 6: 15..Qd4 (ouch). The 3 I labeled "ouch" are particularly bad.

Also attached: leela300-discoveries.txt

dubslow commented 6 years ago

From @rwbc on May 17, 2018 9:2

Interesting blind spot (with mate) again in current ID 304 (from matchplay):

http://lczero.org/match_game/280156

dubslow commented 6 years ago

From @hsntgm on May 17, 2018 12:14

Latest ID's have higher elo so i tested ID 303 with 60 min game with stockfish yesterday i watched a whole game.

I was hoping a good result but it seems she still suffer this unknown bug in deep. I am not strong player just 1700 and i simply see especially her position evaluation and simple tactic understanding terribly broken if i compare older networks.So this elo gains is not true.

It seems it will be very difficult to get rid of the past bug effect.She has broken heart.If you believe that you found the bug in algorithm and corrected it maybe you just have to start whole training process again.If the problem is training data and you corrected it i think you have to re-start again because she will have to do two times more training than is necessary to get rid of this effect.

At least simulate it from the beginning with all the training data (from the best clean one we know is ID253) and check the situation.Still something goes wrong and please listen to people.

ASilver commented 6 years ago

ok, there is still a bug in Leela and I think it is clearly linked to the 50-move rule somehow. I was checking the PGN of my CLOP run using NN357, and TC of 48s+0.1s. I found a loss in 180 moves which is strange since I set it for adjudication by tablebase (5-piece). I then look at it and see this. First move 133, the final pawn capture, 133. Kxf3:

50bug-01 The 50-move counter reset here means that it will be in effect on move 182 (move 50). Clearly a draw here, as black has pawn and can exchange bishop any day for white's piece. But then on move 179.... It gives up its bishop for NO reason at all, and instantly loses:

50bug-02

I see no explanation other than the 50-move rule being a factor here. I am attaching the PGN. leela-50move-bug.zip