Closed Mardak closed 2 years ago
This is sadly to be expected, as policies reflect the estimations of values in training games; whenever the path to conversion has too many possibilites of going wrong, the eval at low nodes will be worse than for staying in the same situation for longer, and only when shuffling isn't an option anymore, the slight eval loss of the correct conversion seems acceptable to the net. This will only be solved by giving any incentive for progress as MLH does for example.
I don't think it's expected, we should watch and think about a fix. MLH will likely help in endgames, but for midgame fortresses some debugging is needed. As you can see for earlier nets it was less of a problem, and also a0 seemingly didn't have this problem at all.
Actually, I think it makes sense to split this into two issues:
It seems they have the same root cause (poor policy), but many people consider them different.
Looking more closely at the search behavior when changing the rule50ply for various positions, the network policies do reflect the network's position WDL value. E.g., using ttl-mlh-added 384x30-2-swa-30000.pb and looking at the highest prior move and the pawn move from the first position above:
go nodes 1000 searchmoves f5e3
go nodes 2000 searchmoves c3c4
position fen 8/3k3p/1rrp1p1P/3RpNp1/p3P1P1/PnP2P2/KP6/3R4 w - - 0
info c3c4 N: 1000 (P: 3.02%) (WL: 0.59994) (D: 0.238) (M: 85.3)
info f5e3 N: 1000 (P: 23.24%) (WL: 0.87349) (D: 0.074) (M: 79.3)
position fen 8/3k3p/1rrp1p1P/3RpNp1/p3P1P1/PnP2P2/KP6/3R4 w - - 50
info c3c4 N: 1000 (P: 4.80%) (WL: 0.60310) (D: 0.237) (M: 85.0)
info f5e3 N: 1000 (P: 20.00%) (WL: 0.79350) (D: 0.129) (M: 84.9)
position fen 8/3k3p/1rrp1p1P/3RpNp1/p3P1P1/PnP2P2/KP6/3R4 w - - 95
info c3c4 N: 1000 (P: 11.13%) (WL: 0.59668) (D: 0.242) (M: 85.9)
info f5e3 N: 1000 (P: 13.24%) (WL: 0.37139) (D: 0.540) (M: 66.6)
Notice how the c4 pawn move policy increases from 3% to 11% while WDL is pretty consistent (as it's a zeroing move). And the highest prior Ne3 move keeps decreasing reflecting the network's more drawish WDL value.
I suppose is the problem more that the WDL head is too optimistic and only decides to push the pawn when the zeroing move looks less drawish than shuffling into rule50 draw? Also, looks like the MLH isn't quite helpful in this particular position / searched nodes as the knight move predicts fewer moves.
As there haven't been reported similar issues with more recent networks, I'm closing this as stale.
Here's some test positions and the raw policy for the expected progressing move, e.g., the one played at TCEC17 SuFi.
From game 84, there was 80 ply before lc0 decided to move the pawn https://www.tcec-chess.com/archive.html?season=17&div=sf&game=84:
From game 26, this 8-piece KRNPPvKRB endgame waited 91 ply before lc0 made some progress https://www.tcec-chess.com/archive.html?season=17&div=sf&game=26:
And a test 3-piece KRvK endgame shows policies some runs peaking at 87 ply before 88 ply would result in a cursed win for white: