LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.42k stars 527 forks source link

Deadlock when using many threads with multiplexing backend #157

Closed Cyanogenoid closed 6 years ago

Cyanogenoid commented 6 years ago

I am getting consistent deadlocks when using enough threads (>= 20) with the multiplexing backend. I confirmed that this only happens since https://github.com/LeelaChessZero/lc0/pull/147 was merged.

$ lc0 --backend=multiplexing --backend-opts=cudnn --threads=20
       _
|   _ | |
|_ |_ |_| built Jul  9 2018
go infinite
Found network file: ./weights.txt
Creating backend [multiplexing]...
Creating backend [cudnn]...
info depth 1 seldepth 2 time 26 nodes 3 score cp 12 hashfull 0 nps 115 pv c2c4 e7e6 c4c5
info depth 1 seldepth 3 time 30 nodes 4 score cp 4 hashfull 0 nps 133 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 5031 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 10032 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 15033 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 20034 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 25035 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 30036 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 35037 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 40038 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 45039 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 50040 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 55041 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 60042 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3
info depth 1 seldepth 3 time 65043 nodes 5 score cp 3 hashfull 0 nps 0 pv c2c4 c7c5 b1c3

Edit: crem asked for a stack trace in Discord: https://gist.github.com/Cyanogenoid/6376bf49d00af1254aefbcd2c0aa59b9

mooskagh commented 6 years ago

I'm not able to reproduce that.

mooskagh commented 6 years ago

Also it's very weird that it could only visit 5 nodes in 65 seconds. I could visit more!

Cyanogenoid commented 6 years ago

This is only happening on the machines in a compute cluster I have access to, which are running Red Hat Enterprise Linux Server release 7.4 (Maipo). I am unable to reproduce this on other machines (Ubuntu 16.04, Arch Linux). nvidia-smi shows that lc0 is allocated, but at 0% utilisation. htop looks like this when it gets stuck:

It gets stuck at different points, sometimes 2 nodes, sometimes 9 nodes, occasionally it doesn't even print out anything after Creating backend [cudnn].... This is all using the latest master with any of the recent test nets (I tried a few from test 10 and one from test 9, all getting stuck). no-smart-pruning and minibatch-size params don't change anything. Between 16 and 19 threads, it sometimes works completely fine throughout, sometimes gets stuck very early on. In fact, I have never seen it get stuck after 15 nodes, either it stops before 15 and stays there or goes past and is fine the whole way.

dubslow commented 6 years ago

And you have double-triple extra confirmed that it was 147 which is the culprit?

Cyanogenoid commented 6 years ago

Yes. git checkout e4d0123b5223879f435713ae5ddd0fe971570dd3 (one commit before the 147 merge commit) followed by a build makes the command ./lc0 --backend=multiplexing --backend-opts=cudnn --threads=40 -w ../../weights.txt work fine over multiple runs. git checkout 407d841b339e8f8394ee80717fab2a833f2b0695 (147 merge commit) followed by a build makes the same command get stuck on multiple runs (though it's not guaranteed to get stuck, only most of the time).

dje-dev commented 6 years ago

Wow, fascinating - sorry about the difficulties. It sounds like perhaps a race condition in some aspect of initialization. The modification involved replacing 3 acquisitions with one, so one could iterate at intermediate points to see where it started to break, hoping that would provide a clue (but it's not clear it would....). Or just maybe there was always a bug where workers can sometimes start before initialization is complete, but the bug was masked previously because of the inefficiency (slowdown).

mooskagh commented 6 years ago

@Cyanogenoid could you try to compile in debug mode and get stack traces again? If it's not reproducible in debug, then debugoptimized mode probably still should work.

Also, would it be possible to get several stack traces from the same run? As it seems to be not a deadlock but some loop. After getting stacktrace, type cont in gdb to continue execution, and then Ctrl+C again, and again threads apply all bt, and repeat that 2-3 times.

Thanks.

dubslow commented 6 years ago

@Cyanogenoid Have you tried setting --max-prefetch=0? About half the threads get stuck in the related function, maybe that's part of the cause.

mooskagh commented 6 years ago

165 should fix that (or may be it just masks).

Is it still reproducible with #165?

Cyanogenoid commented 6 years ago

With --max-prefetch=0, I can't reproduce it anymore. With #165, I can't reproduce it anymore either.

dubslow commented 6 years ago

Can you clarify that with 407d841b339e8f8394ee80717fab2a833f2b0695, which was the original problem commit, that --max-prefetch=0 prevents deadlocks?

Cyanogenoid commented 6 years ago

Using https://github.com/LeelaChessZero/lc0/commit/79de494745022efbe2ffbf139f174946cad0e4fc (last commit before #165 merge), it gets stuck usually and no longer gets stuck with --max-prefetch=0.

dubslow commented 6 years ago

So both solutions work, independently? Fascinating

Cyanogenoid commented 6 years ago

Correct, using only #165 without --max-prefetch=0 or only --max-prefetch=0 without #165 both work.