Open lex312 opened 4 years ago
As mentioned in Discord, that's most likely either too weak PSU for two GPUs, or poor cooling of GPUs. Both is quite common in dual GPU systems.
I have a Corsair AX1600i PSU. And lots of high end air cooler. One GPU has only 74 degrees celsius. The other one has 50.
The problem happens always, no matter if I use -gpu 0 -gpu 1 or only one gpu.
You didn't specify which gpu you have and what backend is being used. No matter, the client is unlikely to be causing this, but the way lc0 is called may be loading the gpu too much. Try lowering the parallelism to 4 (or less) to see if this help.
Finally, note that the client only uses one gpu by default. Adding a second -gpu
to the command line just overrides the first. If you want to run on both gpus, the most efficient way is to run a second client instance, with a different -gpu
number.
I have 2x the same RTX 2080 Ti. I'm running the client with this: client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name) As you can see no backend and no parallelism.
Okay then I will use 2 clients instead of one. But I still have the same problem. Also note that I think to most people don't know that the second gpu overrides the first gpu, when only one client is in use. So this should be also fixed.
When using chess guis I use: (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1) and roundrobin.
I asked about backend in case you were using the relatively new dx12, but I see you use cudnn-fp16 so this shouldn't be an issue. Is this only happening on run 3?
No this happens also when running run 1 or 2.
I have the same problem when I use a power limit of 40%.
When you run the client, the output near the top contains the exact lc0 command line used. Can you try this on its own to confirm the client has nothing to do with this?
Example from an old log I had: /content/lc0/build/lc0 selfplay --backend-opts=backend=cudnn-fp16 --parallelism=32 --visits=10000 --cpuct=2.5 --cpuct-at-root=2.5 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.90 --temp-endgame=0.75 --temp-cutoff-move=16 --temp-visit-offset=-0.8 --fpu-strategy=absolute --fpu-value=-1.0 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000012 --policy-softmax-temp=1.2 --resign-wdlstyle=true --training=true --weights=client-cache/3eb9d62ecc6aa2a84b7cdb789c50702a02477cf969949cf7ed788b71a3ea9cfa
@borg323 What exactly do you want me to do?
On my machine it looks like this: Z:\LC0>client -run 3 -gpu 0 -report-gpu -report-host -user (name) -password (name) Lc0 client version 26 2020/05/08 15:00:09 lc0main.go:956: serverParams: [--visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --r esign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strate gy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alp ha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=s huffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slop e=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0] Args: [Z:\LC0/lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --paralle lism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-par ams=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --tem p-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=red uction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle= true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pg n=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves- left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 -- moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b57 96723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9] | | | | | || v0.25.1+git.69105b4 built Apr 30 2020 id name Lc0 v0.25.1+git.69105b4 id author The LCZero Authors. Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58 2ebf028ece402eb6fe50c3a9 Creating backend [multiplexing]... Creating backend [cudnn-fp16]... CUDA Runtime version: 10.0.0 Cudnn version: 7.4.2 Latest version of CUDA supported by the driver: 10.1.0 GPU: GeForce RTX 2080 Ti GPU memory: 11 Gb GPU clock frequency: 1545 MHz GPU compute capability: 7.5 PGN: [FEN "bnrnkbqr/pppppppp/8/8/8/8/PPPPPPPP/BNRNKBQR w KQkq - 0 1"]
Then the command to run would be:
Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
@borg323 I tried to run the command and got this:
Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 - -parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-c puct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0 .8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-stra tegy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-r oot=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-w dlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --ope nings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-facto r=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fd f4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9 | | | | | |_| v0.25.1+git.69105b4 built Apr 30 2020 id name Lc0 v0.25.1+git.69105b4 id author The LCZero Authors. Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58 2ebf028ece402eb6fe50c3a9 Unhandled exception: Cannot read weights from client-cache\fdf4c93b5796723fd1ec8 8b09dcc92474a727a582ebf028ece402eb6fe50c3a9
Also I got a little taskmanager window with the information: lc0.exe doesn't work anymore.
Probably you are not running lc0 from the same directory the client (and lc0) are in. I assume this is Z:\LC0
. There should be books
and client-cache
subdirectories, the first one containing 960fen.pgn
and the second one containing fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
.
@borg323
I have inside Z:\LC0 lc0.exe and client.exe and the other basic lc0 things. Also the books and client-cache subdirectories are there. 960fen.pgn is inside books and inside client-cache I have the right fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
I want to donate gpus to -run 3. That's why I open an empty cmd and pasted inside what you have wrote me before.
We appreciate it, but first we need to figure out what is causing the problem. Here is the procedure: Open a cmd window and then type:
Z:
CD \LC0
This will take you to the LC0 directory, and then run the command I gave earlier. I expect it will have the same problem We can then try to modify the command to see if we can isolate the issue.
@borg323
I have typed: Z: CD \LC0 and then the command to run, which I've got from you. I will tell you later when it crashes again. This is how it looks now:
Z:\LC0>Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --paralle lism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-par ams=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --tem p-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=red uction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle= true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pg n=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves- left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 -- moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b57 96723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9 | | | | | |_| v0.25.1+git.69105b4 built Apr 30 2020 id name Lc0 v0.25.1+git.69105b4 id author The LCZero Authors. Loading weights file from: client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a58 2ebf028ece402eb6fe50c3a9 Creating backend [multiplexing]... Creating backend [cudnn-fp16]... CUDA Runtime version: 10.0.0 Cudnn version: 7.4.2 Latest version of CUDA supported by the driver: 10.1.0 GPU: GeForce RTX 2080 Ti GPU memory: 11 Gb GPU clock frequency: 1545 MHz GPU compute capability: 7.5 gameready trainingfile Z:\LC0/data-hyepracghopp/game_000029.gz gameid 29 play_st art_ply 0 player1 white result blackwon moves b2b3 g7g5 b1c3 f7f5 g2g4 f5f4 e2e3 b8c6 d2d4 e8g6 f1e2 e7e6 d1b2 d7d5 e2d2 c6b4 c1a1 g6c2 a2a3 c2d1 d2d1 b4c6 g1g3 from_fen rnknbqrb/pppppppp/8/8/8/8/PPPPPPPP/RNKNBQRB w KQkq - 0 1 tournamentstatus P1: +0 -1 =0 LOS: 15.87% P1-W: +0 -1 =0 P1-B: +0 -0 =0 npm 600. 875000 nodes 14421 moves 24 gameready trainingfile Z:\LC0/data-hyepracghopp/game_000001.gz gameid 1 play_sta rt_ply 0 player1 white result whitewon moves d2d3 c7c6 f1g3 f8e6 b2b4 d8c7 e2e4 a7a5 b4b5 g8f6 e4e5 f6d5 a2a4 e8g8 g1f3 d7d6 e5d6 c7d6 e1g1 b7b6 f1e1 a8b7 g3f5 b7c7 e1e6 d6h2 from_fen qrbbknnr/pppppppp/8/8/8/8/PPPPPPPP/QRBBKNNR w KQkq - 0 1
@borg323
No crash after 8 hours with gpu 0. I'm running now gpu 1 for 8 hours.
@borg323
No crash after 8 hours with gpu 1.
What does it mean and what to do next?
Can you leave the above command running and open up a new command prompt and run the same command again but this time with gpu=0
changed to gpu=1
? This way you can test running two at the same time on your two GPUs?
No matter, the client is unlikely to be causing this, but the way lc0 is called may be loading the gpu too much. Try lowering the parallelism to 4 (or less) to see if this help.
I didn't see you trying these suggestions yet -- can you test whether the power spikes are still bad enough with lower parallelism to crash your PC?
The only way I could see the client being the cause is if your networking driver crashes from upload/downloads and it takes the GPU with it.
The only way I could see the client being the cause is if your networking driver crashes from upload/downloads and it takes the GPU with it.
Or it may be some weird antivirus software reaction, having the same effect. The client doesn't do much more than downloading network files from the server, uploading results and running lc0.
@cn4750
When I open cmd and use this: client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name) Then the problem still happens.
When I open cmd and use this: Z: CD \LC0
and then that:
Z:\LC0\lc0.exe selfplay --backend-opts=backend=cudnn-fp16,gpu=0 --parallelism=32 --visits=10000 --cpuct=1.32 --cpuct-at-root=1.9 --root-has-own-cpuct-params=true --resign-percentage=4.0 --resign-playthrough=20 --temperature=0.8 --temp-endgame=0.30 --temp-cutoff-move=60 --temp-visit-offset=-0.8 --fpu-strategy=reduction --fpu-value=0.23 --fpu-strategy-at-root=absolute --fpu-value-at-root=1.0 --minimum-kldgain-per-node=0.000040 --policy-softmax-temp=1.4 --resign-wdlstyle=true --noise-epsilon=0.1 --noise-alpha=0.12 --sticky-endgames=true --openings-pgn=books/960fen.pgn --openings-mode=shuffled --moves-left-max-effect=0.2 --moves-left-threshold=0.0 --moves-left-slope=0.009 --moves-left-quadratic-factor=1.0 --moves-left-constant-factor=0.0 --training=true --weights=client-cache\fdf4c93b5796723fd1ec88b09dcc92474a727a582ebf028ece402eb6fe50c3a9
Then it looks like I have no problems.
But the first is client.exe and the second is lc0.exe. And I have no problems when using lc0.exe to play games or something using a gui.
I also tested gpu 0 and gpu 1 at the same time with two cmds and I have no problems when using lc0.exe. But I have still the problem using the client.exe.
@cn4750
Is there a way I can check if the networking driver has crashed? Is it possible to upload less often??? I think the download from time to time should not be a problem but it looks to me like the gpus are producing extremly fast material to upload and upload and upload. Maybe that's taken the client or the gpus with it.
@Naphthalin
Have you a line of code for me, how it should look like when using parallelism 4 and that: client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)
@borg323
I'm using the 360 total security as antivirus software. But the software would have asked me for a decision if it would found a virus or other things.
And when I use to cmds: client -run 3 -gpu 0 -report-gpu -report-host -user (name) -password (name) client -run 3 -gpu 1 -report-gpu -report-host -user (name) -password (name) Then the problem still happens.
Z: cd Z:\LC0 client -run 3 -gpu 0 -report-gpu -report-host -parallelism=4 -user (name) -password (name)
Z: cd Z:\LC0 client -run 3 -gpu 1 -report-gpu -report-host -parallelism=4 -user (name) -password (name)
This works fine. I have no problems and no black screen after 9 hours of running both gpus. The only difference is that I use here -parallelism=4.
Have someone any ideas what exactly caused the bug? Can it be solved somehow or do I need to check parallelism from =5 to =31 too?
Have someone any ideas what exactly caused the bug?
by mooskagh, first post:
As mentioned in Discord, that's most likely either too weak PSU for two GPUs, or poor cooling of GPUs. Both is quite common in dual GPU systems.
It's good to know that lower parallelism helps with stabilizing the power demand enough. We basically use parallelism to load the GPU more, but apparently that puts too much variation to the PSU.
@Naphthalin
Lower parallelism doesn't helped. I used 4, 8, 16, 17, 18, 19, 20, 21, 22, 23, 24, 32 I repeated also parallelism 4 and it crashed. Sometimes it crashes after 30 minutes and sometimes it crashes after up to 13 hours and it doesn't matter what parallelism I'm using.
The PSU ins't to weak because it's the best PSU someone can buy for a lot of money on the market and it can be easy used with 4 GPUs. There is also no poor cooling, because both GPUs have only 50 degrees celsius, when I decrease the power limit. The GPUs can also have 88 degrees celsius without problems.
Any other ideas?
PSUs do deterioriate so I wouldn't be so confident about it no matter what. Tensorflow and AI apps put a "spikey" load on it and the minute it exceeds a threshold, your CPU will shut down. I recently had a very bad experience where I can do many things just fine but trying to train a net it shuts down in 30 minutes. The PSU had maybe be a +200 extra watt on it but that didn't help. Your case maybe different but monitoring power usage right before it goes blank may give clues.
If you had crashes at 4, then higher values for parallelism are likely worse.
What I don't see from this thread: Did you try starting two separate clients for the two GPUs with --parallelism=4
and experience the same crashes? I don't know the technical details of the client, as it is always recommended to start one client per GPU, but it could theoretically be that the client isn't as sophisticated when distributing jobs between several GPUs.
Still, the cause of your crashes 99% isn't software related, but comes from an apparently too unstable power demand of two GPUs, and the fact that your PSU is good doesn't necesessarily mean that it is good enough for this extreme scenario.
When I start the client: client -run 3 -gpu 0 -gpu 1 -report-gpu -report-host -user (name) -password (name)
it works fine for some minutes but then the client caused a black screen. The machine is still running but no games are played. No other things are possible and I need to restart the pc.
I have exactly the same gpus. And no problems with fritz gui or chessbase 15 gui.
When running the client I see that it is using only one gpu and not both - how to fix this? I can see it with msi afterburner and with gpu z.
Do I need parallelism? Are there other things which I can also use with cmd?
Do we have something like logfile.txt when running the client?