LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.44k stars 528 forks source link

Policy favors shuffling instead of progress until necessary #1229

Closed Mardak closed 2 years ago

Mardak commented 4 years ago

Here's some test positions and the raw policy for the expected progressing move, e.g., the one played at TCEC17 SuFi.

From game 84, there was 80 ply before lc0 decided to move the pawn https://www.tcec-chess.com/archive.html?season=17&div=sf&game=84: c3 move c3 policy vs rule50ply

From game 26, this 8-piece KRNPPvKRB endgame waited 91 ply before lc0 made some progress https://www.tcec-chess.com/archive.html?season=17&div=sf&game=26: f4 move f4 policy vs rule50ply

And a test 3-piece KRvK endgame shows policies some runs peaking at 87 ply before 88 ply would result in a cursed win for white: Ka2 move Ka2 policy vs rule50ply

Naphthalin commented 4 years ago

This is sadly to be expected, as policies reflect the estimations of values in training games; whenever the path to conversion has too many possibilites of going wrong, the eval at low nodes will be worse than for staying in the same situation for longer, and only when shuffling isn't an option anymore, the slight eval loss of the correct conversion seems acceptable to the net. This will only be solved by giving any incentive for progress as MLH does for example.

mooskagh commented 4 years ago

I don't think it's expected, we should watch and think about a fix. MLH will likely help in endgames, but for midgame fortresses some debugging is needed. As you can see for earlier nets it was less of a problem, and also a0 seemingly didn't have this problem at all.

mooskagh commented 4 years ago

Actually, I think it makes sense to split this into two issues:

It seems they have the same root cause (poor policy), but many people consider them different.

Mardak commented 4 years ago

Looking more closely at the search behavior when changing the rule50ply for various positions, the network policies do reflect the network's position WDL value. E.g., using ttl-mlh-added 384x30-2-swa-30000.pb and looking at the highest prior move and the pawn move from the first position above: c3 move

go nodes 1000 searchmoves f5e3
go nodes 2000 searchmoves c3c4

position fen 8/3k3p/1rrp1p1P/3RpNp1/p3P1P1/PnP2P2/KP6/3R4 w - - 0
info c3c4 N:    1000 (P:  3.02%) (WL:  0.59994) (D: 0.238) (M: 85.3)
info f5e3 N:    1000 (P: 23.24%) (WL:  0.87349) (D: 0.074) (M: 79.3)

position fen 8/3k3p/1rrp1p1P/3RpNp1/p3P1P1/PnP2P2/KP6/3R4 w - - 50
info c3c4 N:    1000 (P:  4.80%) (WL:  0.60310) (D: 0.237) (M: 85.0)
info f5e3 N:    1000 (P: 20.00%) (WL:  0.79350) (D: 0.129) (M: 84.9)

position fen 8/3k3p/1rrp1p1P/3RpNp1/p3P1P1/PnP2P2/KP6/3R4 w - - 95
info c3c4 N:    1000 (P: 11.13%) (WL:  0.59668) (D: 0.242) (M: 85.9)
info f5e3 N:    1000 (P: 13.24%) (WL:  0.37139) (D: 0.540) (M: 66.6)

Notice how the c4 pawn move policy increases from 3% to 11% while WDL is pretty consistent (as it's a zeroing move). And the highest prior Ne3 move keeps decreasing reflecting the network's more drawish WDL value.

I suppose is the problem more that the WDL head is too optimistic and only decides to push the pawn when the zeroing move looks less drawish than shuffling into rule50 draw? Also, looks like the MLH isn't quite helpful in this particular position / searched nodes as the knight move predicts fewer moves.

Naphthalin commented 2 years ago

As there haven't been reported similar issues with more recent networks, I'm closing this as stale.