Closed Chess321 closed 6 years ago
Can you give more details? Is there an error message? Can you copy any output text (specifically the initial engine output and then some more close to the end)?
@all12 could you maybe record a log file? (--logfile command line option, or LogFile UCI param)
How can I do these things?
I need a step by step explanation please.
I can use the Fritz 16 gui and CB14 gui and Aquarium 2018 gui if needed.
In that case, the easiest thing is probably to make a log file. The GUI should have an engine configuration dialog, where you will find an option named LogFile
. Depending on the GUI, it may have a file selector, so make a new file and then run as usual. Starting with v0.19 the log file records very detailed information on the internal workings of lc0, which will hopefully help us determine what is happening.
Thx for the link, maybe this would help me later. I used the Aquarium 2018 gui and here is the logfile: LogFile.txt
Thanks for the log. I can see that there is something that looks like a bug. After ponder, search thread doesn't exit.
As an unrelated but suspicions issue, it's not clear why log starts from the middle of something.
@all12 do you get crashes even with pondering disabled?
Seems that something bad happens when ponderhit
arrives before search starts to search.
I'm not able to reproduce it yet though.
@borg323,
yes.
Should I do a logfile with ponder disabled?
1108 21:05:29.558816 6648 ..\src/chess/uciloop.cc:218] << bestmove d2d6 ponder d7d6
1108 21:05:29.558843 6648 ..\src/mcts/search.cc:615] End a watchdog thread.
1108 21:05:29.559963 6372 ..\src/chess/uciloop.cc:131] >> setoption name UCI_Chess960 value false
1108 21:05:29.560012 6372 c:\projects\lc0\src\utils/exception.h:39] Exception: Unknown option: UCI_Chess960
1108 21:05:29.560054 6372 ..\src/chess/uciloop.cc:218] << error Unknown option: UCI_Chess960
1108 21:05:29.560093 6372 ..\src/chess/uciloop.cc:131] >> position startpos moves a2a3 e7e5 c2c4 g8f6 d2d3 b8c6 g1f3 d7d5 c4d5 f6d5 e2e4 d5b6 f1e2 f8e7 c1e3 f7f5 b2b4 f5e4 d3e4 d8d1 e2d1 c8e6 b1d2 e8c8 e1g1 a7a5 b4a5 c6a5 f3e5 e7f6 a1c1 f6e5 e3b6 a5c6 b6e3 e5b2 c1c2 b2a3 d1e2 c6b4 c2c5 b4c6 c5c2 c6b4 c2c5 b4c6 c5g5 a3b2 f1b1 b2d4 d2f3 h7h6 g5b5 b7b6 f3d4 c6d4 b5b2 d4e2 b2e2 c8b7 e2b2 h8e8 f2f3 d8d6 h2h4 e8d8 h4h5 d6d3 g1f2 d8d6 g2g4 d6c6 b1c1 d3d7 b2d2 c6d6 d2d6 d7d6
1108 21:05:29.560179 6372 ..\src/mcts/search.cc:634] Aborting search, if it is still active.
# Previous search destroyed as expected after position command.
1108 21:05:29.772701 6372 ..\src/mcts/search.cc:648] Search destroyed.
# Hm... weird there is wtime but not btime. I wonder if that goes well with ponder.
# Anyway, ponder starts.
1108 21:05:29.772783 6372 ..\src/chess/uciloop.cc:131] >> go ponder wtime 3359 winc 2000
1108 21:05:29.772881 6372 ..\src/engine.cc:378] Limits: visits:4000000000 playouts:-1 depth:-1 infinite:1
1108 21:05:29.773051 6076 ..\src/mcts/search.cc:586] Start a watchdog thread.
1108 21:05:29.773136 7392 C:\projects\lc0\src\mcts/search.h:200] Started search thread.
1108 21:05:29.774994 7884 C:\projects\lc0\src\mcts/search.h:200] Started search thread.
# Request to stop ponder and start normal search.
1108 21:05:29.775006 6372 ..\src/chess/uciloop.cc:131] >> ponderhit
# Entering destructor.
1108 21:05:29.775062 6372 ..\src/mcts/search.cc:634] Aborting search, if it is still active.
# Watchdog thread is finished (we don't have the same notification for search thread, maybe we should add it).
1108 21:05:29.775097 6076 ..\src/mcts/search.cc:615] End a watchdog thread.
# Destructor never finish, waiting for some threads to return.
Here is the logfile from the Fritz 16 gui: LogFile.txt
Looking at the first log I noticed multiple times where there is a 700ms wait, but search time thinks <100ms has passed by the time its calculating the search deadline. Which seems impossible...
Looking at the second log I see that 'stop' is synchronous and stops us from recording the arrival time of the position command properly since its blocked the main thread.
I managed to reproduce it (at least the ponder issue) with the blas backend using short time control and 4 threads per side on a 4 core processor to increase contention. No time to look at it now, but full log attached, the first 3 games are OK the 4th hanged .
I don't think this latest repro is the same the one mooskagh analyzed above. This one seems to have a fairly straight forward explanation, Search threads check for stop before doing any work, which can mean they exit without doing anything, and thus best move will never fire and watch dog never exits. Same explanation doesn't work for the ponderhit path - as abort sets the bestmove fired flag allowing the watch dog to exit.
Here is the logfile from the ChessBase 14 gui: LogFile.txt
Here is the logfile from the Aquarium 2018 gui but this time with ponder off: LogFile.txt
@all12 can you try https://ci.appveyor.com/api/buildjobs/o7xxho302u5y5i1n/artifacts/build%2Flc0.exe? This fixes the problem I noticed so there is some hope it may help in your case.
@borg323 Here is the logfile from the Aquarium 2018 gui with ponder on: LogFile.txt
I tried it and it has fixed one crash bug, when I start infinite analysis and go forth and back and forth... in the notation or click in the notation.
But it still sometimes crashes during engine matches and also sometimes doesn't start to move at the beginning of the game during engine matches.
Here is the logfile from the Aquarium 2018 gui with ponder off: LogFile.txt
I tried it and it has fixed one crash bug, when I start infinite analysis and go forth and back and forth... in the notation or click in the notation.
But it still sometimes loses on time during engine matches and also sometimes doesn't start to move at the beginning of the game during engine matches.
Fritz 16 gui: LogFile.txt
ChessBase 14 gui: LogFile.txt
@all12 can you try https://ci.appveyor.com/api/buildjobs/uek9j6up9csu8a28/artifacts/build%2Flc0.exe? This reverts a bug fix that touches code used mainly with AMD GPUs. As such, if it works it is not a fix for this issue, and may only be masking it, but will point us in the right direction.
@borg323 Here is the logfile from the Aquarium 2018 gui with ponder on: LogFile.txt I started infinite analysis and I saw in the task manager that LC0 was running but the first pv was displayed after 1 minute and 12 seconds.
The delay was because the OpenCL tuner had to run again to find a set of parameters with good performance. The next run should start searching immediately.
I don't have the time right now to look through the log thoroughly, but it looks OK. Is this correct?
@borg323, ok good to know. Yes the next run started immediately. LC0 still crashes or loses on time with ponder on.
@borg323 Here is the logfile from the Aquarium 2018 gui with ponder off: LogFile.txt It's great, it fixed the loses on time bug and it also fixed the Lc0 crashes bug.
Fritz 16 gui: LogFile.txt
ChessBase 14 gui: LogFile.txt
Same corruption as https://github.com/LeelaChessZero/lc0/issues/507.
1112 13:35:50.31411 3180 ..\src/chess/uciloop.cc:131] >> go ponder wtime 9000 btime 2000 winc 1000 binc 1000
1112 13:35:50.31473 3180 c:\projects\lc0\src\utils/exception.h:39] Exception: Bad fen string: ÿmbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
1112 13:35:50.31582 3180 ..\src/chess/uciloop.cc:218] << error Bad fen string: ÿmbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
1112 13:35:51.470267 3180 ..\src/chess/uciloop.cc:131] >> ponderhit
1112 13:35:51.470346 3180 c:\projects\lc0\src\utils/exception.h:39] Exception: Bad fen string: ÿmbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
1112 13:35:51.470387 3180 ..\src/chess/uciloop.cc:218] << error Bad fen string: ÿmbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
1112 13:36:00.899756 3180 ..\src/chess/uciloop.cc:131] >> stop
@all12 can you test with pure blas in order to check if it still happens without OpenCL ? Thanks.
From a quick look, both seem to have finished fine. Is this correct?
@borg323, both doesn't crashes but still loses on time.
@frpays,
Pure blas with ponder off: Fritz 16, ponder off, 1+1, LogFile.txt
@all12 My apologies, I wasn't very clear earlier. The time losses are somewhat expected, we have high overhead per move and a small increment is not always enough. Try increasing the MoveOverheadMs
parameter to 1000 (or 2000) from the default 200, this should help. We will improve this eventually, but right now we are focused on the actual crashes to release v0.19.
@borg323
Should I do this for the opencl or blas?
This is just a general recommendation to avoid time losses with small increment until we improve the situation. BTW, for better interaction, we can chat on discord where most of us can answer questions quickly.
1000, loses on time. Aquarium 2018, MoveOverheadMs = 1000, ponder on, 1+1, LogFile.txt
2000, loses on time and freezes gui. Aquarium 2018, MoveOverheadMs = 2000, ponder on, 1+1, LogFile.txt
@borg323, the bug is still there.
I already discussed it with many chess players which also has the bug and it turns out that we all have the AMD R9 ... GPUs (different R9 designs).