Closed Cyanogenoid closed 6 years ago
I'm not able to reproduce that.
Also it's very weird that it could only visit 5 nodes in 65 seconds. I could visit more!
This is only happening on the machines in a compute cluster I have access to, which are running Red Hat Enterprise Linux Server release 7.4 (Maipo)
. I am unable to reproduce this on other machines (Ubuntu 16.04, Arch Linux). nvidia-smi shows that lc0 is allocated, but at 0% utilisation. htop looks like this when it gets stuck:
It gets stuck at different points, sometimes 2 nodes, sometimes 9 nodes, occasionally it doesn't even print out anything after Creating backend [cudnn]...
. This is all using the latest master with any of the recent test nets (I tried a few from test 10 and one from test 9, all getting stuck). no-smart-pruning and minibatch-size params don't change anything. Between 16 and 19 threads, it sometimes works completely fine throughout, sometimes gets stuck very early on. In fact, I have never seen it get stuck after 15 nodes, either it stops before 15 and stays there or goes past and is fine the whole way.
And you have double-triple extra confirmed that it was 147 which is the culprit?
Yes. git checkout e4d0123b5223879f435713ae5ddd0fe971570dd3
(one commit before the 147 merge commit) followed by a build makes the command ./lc0 --backend=multiplexing --backend-opts=cudnn --threads=40 -w ../../weights.txt
work fine over multiple runs. git checkout 407d841b339e8f8394ee80717fab2a833f2b0695
(147 merge commit) followed by a build makes the same command get stuck on multiple runs (though it's not guaranteed to get stuck, only most of the time).
Wow, fascinating - sorry about the difficulties. It sounds like perhaps a race condition in some aspect of initialization. The modification involved replacing 3 acquisitions with one, so one could iterate at intermediate points to see where it started to break, hoping that would provide a clue (but it's not clear it would....). Or just maybe there was always a bug where workers can sometimes start before initialization is complete, but the bug was masked previously because of the inefficiency (slowdown).
@Cyanogenoid could you try to compile in debug mode and get stack traces again? If it's not reproducible in debug, then debugoptimized mode probably still should work.
Also, would it be possible to get several stack traces from the same run? As it seems to be not a deadlock but some loop. After getting stacktrace, type cont
in gdb to continue execution, and then Ctrl+C again, and again threads apply all bt
, and repeat that 2-3 times.
Thanks.
@Cyanogenoid Have you tried setting --max-prefetch=0
? About half the threads get stuck in the related function, maybe that's part of the cause.
Is it still reproducible with #165?
With --max-prefetch=0
, I can't reproduce it anymore. With #165, I can't reproduce it anymore either.
Can you clarify that with 407d841b339e8f8394ee80717fab2a833f2b0695, which was the original problem commit, that --max-prefetch=0
prevents deadlocks?
Using https://github.com/LeelaChessZero/lc0/commit/79de494745022efbe2ffbf139f174946cad0e4fc (last commit before #165 merge), it gets stuck usually and no longer gets stuck with --max-prefetch=0
.
So both solutions work, independently? Fascinating
Correct, using only #165 without --max-prefetch=0
or only --max-prefetch=0
without #165 both work.
I am getting consistent deadlocks when using enough threads (>= 20) with the multiplexing backend. I confirmed that this only happens since https://github.com/LeelaChessZero/lc0/pull/147 was merged.
Edit: crem asked for a stack trace in Discord: https://gist.github.com/Cyanogenoid/6376bf49d00af1254aefbcd2c0aa59b9