Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.85k stars 3.33k forks source link

Single machine multi-gpus ddp training stuck #19425

Closed Blakey-Gavin closed 6 months ago

Blakey-Gavin commented 6 months ago

Bug description

At the link below I ran the code base and made almost no changes. https://github.com/bytedance/music_source_separation

When I first ran the code for the above library, my NVIDIA driver version was: 470.223.02, "libnccl-dev" and "libnccl2" was 2.11.4-1+cuda11.0. At this time, using a single machine with multiple gpus can run normally.

However, when I upgraded the NVIDIA driver to 545.23.08, it got stuck when running multiple gpus on a single machine under the same conditions.

While stuck, the GPU utilization was always 100%. As shown below: image

At the same time, the CPU is always at 100%. As shown below: image

What version are you seeing the problem on?

v1.8

How to reproduce the bug

No response

Error messages and logs

WORKSPACE=/media/fourT/gyh/mss/workspaces/bytesep
Global seed set to 42
root        : INFO     Namespace(config_yaml='./scripts/4_train/musdb18hq/configs/vocals-accompaniment,mobilenetse_fullband.yaml', filename='train', gpus=7, mode='train', workspace='/media/fourT/gyh/mss/workspaces/bytesep')
torch.distributed.nn.jit.instantiator: INFO     Created a temporary directory at /tmp/tmp1nbxmgz9
torch.distributed.nn.jit.instantiator: INFO     Writing /tmp/tmp1nbxmgz9/_remote_module_non_scriptable.py
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Global seed set to 42
root        : INFO     Namespace(config_yaml='./scripts/4_train/musdb18hq/configs/vocals-accompaniment,mobilenetse_fullband.yaml', filename='train', gpus=7, mode='train', workspace='/media/fourT/gyh/mss/workspaces/bytesep')
torch.distributed.nn.jit.instantiator: INFO     Created a temporary directory at /tmp/tmphl1oxcgf
torch.distributed.nn.jit.instantiator: INFO     Writing /tmp/tmphl1oxcgf/_remote_module_non_scriptable.py
Global seed set to 42
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/7
Global seed set to 42
root        : INFO     Namespace(config_yaml='./scripts/4_train/musdb18hq/configs/vocals-accompaniment,mobilenetse_fullband.yaml', filename='train', gpus=7, mode='train', workspace='/media/fourT/gyh/mss/workspaces/bytesep')
torch.distributed.nn.jit.instantiator: INFO     Created a temporary directory at /tmp/tmpevsp0u4f
torch.distributed.nn.jit.instantiator: INFO     Writing /tmp/tmpevsp0u4f/_remote_module_non_scriptable.py
Global seed set to 42
initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/7
Global seed set to 42
root        : INFO     Namespace(config_yaml='./scripts/4_train/musdb18hq/configs/vocals-accompaniment,mobilenetse_fullband.yaml', filename='train', gpus=7, mode='train', workspace='/media/fourT/gyh/mss/workspaces/bytesep')
torch.distributed.nn.jit.instantiator: INFO     Created a temporary directory at /tmp/tmpkne414_4
torch.distributed.nn.jit.instantiator: INFO     Writing /tmp/tmpkne414_4/_remote_module_non_scriptable.py
Global seed set to 42
initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/7
Global seed set to 42
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/7
torch.distributed.distributed_c10d: INFO     Added key: store_based_barrier_key:1 to store for rank: 1
torch.distributed.distributed_c10d: INFO     Added key: store_based_barrier_key:1 to store for rank: 3
torch.distributed.distributed_c10d: INFO     Added key: store_based_barrier_key:1 to store for rank: 2
Global seed set to 42
root        : INFO     Namespace(config_yaml='./scripts/4_train/musdb18hq/configs/vocals-accompaniment,mobilenetse_fullband.yaml', filename='train', gpus=7, mode='train', workspace='/media/fourT/gyh/mss/workspaces/bytesep')
torch.distributed.nn.jit.instantiator: INFO     Created a temporary directory at /tmp/tmprvkky_7w
torch.distributed.nn.jit.instantiator: INFO     Writing /tmp/tmprvkky_7w/_remote_module_non_scriptable.py
Global seed set to 42
initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/7
torch.distributed.distributed_c10d: INFO     Added key: store_based_barrier_key:1 to store for rank: 4
Global seed set to 42
root        : INFO     Namespace(config_yaml='./scripts/4_train/musdb18hq/configs/vocals-accompaniment,mobilenetse_fullband.yaml', filename='train', gpus=7, mode='train', workspace='/media/fourT/gyh/mss/workspaces/bytesep')
torch.distributed.nn.jit.instantiator: INFO     Created a temporary directory at /tmp/tmpb9sx5on1
torch.distributed.nn.jit.instantiator: INFO     Writing /tmp/tmpb9sx5on1/_remote_module_non_scriptable.py
Global seed set to 42
initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/7
torch.distributed.distributed_c10d: INFO     Added key: store_based_barrier_key:1 to store for rank: 5
Global seed set to 42
root        : INFO     Namespace(config_yaml='./scripts/4_train/musdb18hq/configs/vocals-accompaniment,mobilenetse_fullband.yaml', filename='train', gpus=7, mode='train', workspace='/media/fourT/gyh/mss/workspaces/bytesep')
torch.distributed.nn.jit.instantiator: INFO     Created a temporary directory at /tmp/tmp1h0h_10h
torch.distributed.nn.jit.instantiator: INFO     Writing /tmp/tmp1h0h_10h/_remote_module_non_scriptable.py
Global seed set to 42
initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/7
torch.distributed.distributed_c10d: INFO     Added key: store_based_barrier_key:1 to store for rank: 6
torch.distributed.distributed_c10d: INFO     Added key: store_based_barrier_key:1 to store for rank: 0
torch.distributed.distributed_c10d: INFO     Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 7 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 7 processes
----------------------------------------------------------------------------------------------------

torch.distributed.distributed_c10d: INFO     Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 7 nodes.
torch.distributed.distributed_c10d: INFO     Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 7 nodes.
torch.distributed.distributed_c10d: INFO     Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 7 nodes.
torch.distributed.distributed_c10d: INFO     Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 7 nodes.
torch.distributed.distributed_c10d: INFO     Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 7 nodes.
torch.distributed.distributed_c10d: INFO     Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 7 nodes.
user-SYS-4029GP-TRT:34127:34127 [0] NCCL INFO Bootstrap : Using eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34127:34127 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user-SYS-4029GP-TRT:34127:34127 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
user-SYS-4029GP-TRT:34127:34127 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34127:34127 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
user-SYS-4029GP-TRT:34596:34596 [4] NCCL INFO Bootstrap : Using eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34430:34430 [3] NCCL INFO Bootstrap : Using eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34596:34596 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user-SYS-4029GP-TRT:34596:34596 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
user-SYS-4029GP-TRT:34596:34596 [4] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34596:34596 [4] NCCL INFO Using network Socket
user-SYS-4029GP-TRT:34770:34770 [6] NCCL INFO Bootstrap : Using eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34430:34430 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user-SYS-4029GP-TRT:34430:34430 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
user-SYS-4029GP-TRT:34430:34430 [3] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34430:34430 [3] NCCL INFO Using network Socket
user-SYS-4029GP-TRT:34770:34770 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user-SYS-4029GP-TRT:34770:34770 [6] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
user-SYS-4029GP-TRT:34770:34770 [6] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34770:34770 [6] NCCL INFO Using network Socket
user-SYS-4029GP-TRT:34683:34683 [5] NCCL INFO Bootstrap : Using eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34683:34683 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user-SYS-4029GP-TRT:34683:34683 [5] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
user-SYS-4029GP-TRT:34683:34683 [5] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34683:34683 [5] NCCL INFO Using network Socket
user-SYS-4029GP-TRT:34364:34364 [2] NCCL INFO Bootstrap : Using eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34364:34364 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user-SYS-4029GP-TRT:34364:34364 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
user-SYS-4029GP-TRT:34364:34364 [2] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34364:34364 [2] NCCL INFO Using network Socket
user-SYS-4029GP-TRT:34310:34310 [1] NCCL INFO Bootstrap : Using eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34310:34310 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user-SYS-4029GP-TRT:34310:34310 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
user-SYS-4029GP-TRT:34310:34310 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.133<0>
user-SYS-4029GP-TRT:34310:34310 [1] NCCL INFO Using network Socket
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Setting affinity for GPU 5 to 3ff003ff
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Setting affinity for GPU 6 to 3ff003ff
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Setting affinity for GPU 3 to 3ff003ff
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Setting affinity for GPU 2 to 3ff003ff
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Setting affinity for GPU 4 to 3ff003ff
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Channel 00 : 3[1e000] -> 4[3e000] via direct shared memory
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Channel 00 : 2[1d000] -> 3[1e000] via P2P/IPC
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Channel 00 : 6[40000] -> 0[1b000] via direct shared memory
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] via P2P/IPC
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Channel 01 : 3[1e000] -> 4[3e000] via direct shared memory
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Channel 01 : 2[1d000] -> 3[1e000] via P2P/IPC
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Channel 01 : 1[1c000] -> 2[1d000] via P2P/IPC
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Channel 01 : 6[40000] -> 0[1b000] via direct shared memory
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Channel 00 : 5[3f000] -> 6[40000] via P2P/IPC
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Channel 01 : 5[3f000] -> 6[40000] via P2P/IPC
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Connected all rings
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Channel 00 : 2[1d000] -> 1[1c000] via P2P/IPC
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Channel 01 : 2[1d000] -> 1[1c000] via P2P/IPC
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Channel 00 : 4[3e000] -> 5[3f000] via P2P/IPC
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Channel 01 : 4[3e000] -> 5[3f000] via P2P/IPC
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Channel 00 : 0[1b000] -> 1[1c000] via P2P/IPC
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Connected all rings
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Channel 01 : 0[1b000] -> 1[1c000] via P2P/IPC
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Connected all rings
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Channel 00 : 4[3e000] -> 3[1e000] via direct shared memory
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Channel 01 : 4[3e000] -> 3[1e000] via direct shared memory
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Connected all rings
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Connected all rings
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Connected all rings
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Channel 00 : 5[3f000] -> 4[3e000] via P2P/IPC
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Channel 01 : 5[3f000] -> 4[3e000] via P2P/IPC
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Channel 00 : 1[1c000] -> 0[1b000] via P2P/IPC
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Channel 01 : 1[1c000] -> 0[1b000] via P2P/IPC
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO Connected all trees
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 8/8/512
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO Connected all trees
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 8/8/512
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Connected all rings
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Channel 00 : 6[40000] -> 5[3f000] via P2P/IPC
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Channel 01 : 6[40000] -> 5[3f000] via P2P/IPC
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO Connected all trees
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 8/8/512
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO Connected all trees
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 8/8/512
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Channel 00 : 3[1e000] -> 2[1d000] via P2P/IPC
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Channel 01 : 3[1e000] -> 2[1d000] via P2P/IPC
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO Connected all trees
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO Connected all trees
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 8/8/512
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 8/8/512
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO Connected all trees
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 8/8/512
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-SYS-4029GP-TRT:34430:35138 [3] NCCL INFO comm 0x7fb9c8003010 rank 3 nranks 7 cudaDev 3 busId 1e000 - Init COMPLETE
user-SYS-4029GP-TRT:34596:35137 [4] NCCL INFO comm 0x7f06a8003010 rank 4 nranks 7 cudaDev 4 busId 3e000 - Init COMPLETE
user-SYS-4029GP-TRT:34364:35141 [2] NCCL INFO comm 0x7f2fd8003010 rank 2 nranks 7 cudaDev 2 busId 1d000 - Init COMPLETE
user-SYS-4029GP-TRT:34310:35145 [1] NCCL INFO comm 0x7f4e68003010 rank 1 nranks 7 cudaDev 1 busId 1c000 - Init COMPLETE
user-SYS-4029GP-TRT:34770:35139 [6] NCCL INFO comm 0x7f4f98003010 rank 6 nranks 7 cudaDev 6 busId 40000 - Init COMPLETE
user-SYS-4029GP-TRT:34683:35140 [5] NCCL INFO comm 0x7f391c003010 rank 5 nranks 7 cudaDev 5 busId 3f000 - Init COMPLETE
user-SYS-4029GP-TRT:34127:35136 [0] NCCL INFO comm 0x7f4a3c003010 rank 0 nranks 7 cudaDev 0 busId 1b000 - Init COMPLETE
user-SYS-4029GP-TRT:34127:34127 [0] NCCL INFO Launch mode Parallel

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - available: True - version: 11.3 * Lightning: - lightning-utilities: 0.10.1 - pytorch-lightning: 1.5.0 - torch: 1.12.1+cu113 - torchaudio: 0.12.1+cu113 - torchinfo: 1.5.3 - torchlibrosa: 0.0.9 - torchmetrics: 1.3.0.post0 - torchvision: 0.13.1+cu113 * Packages: - absl-py: 1.2.0 - aiohttp: 3.8.3 - aiosignal: 1.2.0 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 22.1.0 - audioread: 3.0.0 - cachetools: 5.2.0 - certifi: 2022.9.24 - cffi: 1.15.1 - charset-normalizer: 2.1.1 - cycler: 0.11.0 - datasets: 2.10.0 - decorator: 5.1.1 - dill: 0.3.6 - einops: 0.3.2 - ffmpeg-python: 0.2.0 - ffprobe: 0.5 - filelock: 3.9.0 - flatbuffers: 23.1.21 - fonttools: 4.37.4 - frozenlist: 1.3.1 - fsspec: 2022.8.2 - future: 0.18.2 - google-auth: 2.12.0 - google-auth-oauthlib: 0.4.6 - grpcio: 1.49.1 - h5py: 3.7.0 - huggingface-hub: 0.12.1 - idna: 3.4 - importlib-metadata: 5.0.0 - importlib-resources: 5.10.0 - joblib: 1.1.0 - jsonschema: 4.16.0 - kiwisolver: 1.4.4 - librosa: 0.9.2 - lightning-utilities: 0.10.1 - llvmlite: 0.40.1 - markdown: 3.4.1 - markupsafe: 2.1.1 - matplotlib: 3.5.0 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mnn: 2.8.1 - more-itertools: 9.0.0 - mpmath: 1.3.0 - multidict: 6.0.2 - multiprocess: 0.70.14 - musdb: 0.4.0 - museval: 0.4.0 - natsort: 8.0.0 - numba: 0.57.0 - numpy: 1.22.0 - oauthlib: 3.2.1 - olefile: 0.46 - onnx: 1.13.0 - onnxruntime-gpu: 1.10.0 - opencv-python: 4.6.0.66 - packaging: 21.3 - pandas: 1.5.0 - pesq: 0.0.3 - pillow: 9.4.0 - pip: 22.2.2 - pkgutil-resolve-name: 1.3.10 - pooch: 1.6.0 - protobuf: 3.20.3 - pyaml: 21.10.1 - pyarrow: 11.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycparser: 2.21 - pydeprecate: 0.3.1 - pyparsing: 3.0.9 - pyrsistent: 0.18.1 - python-dateutil: 2.8.2 - pytorch-lightning: 1.5.0 - pytz: 2022.4 - pyyaml: 6.0 - regex: 2022.10.31 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - resampy: 0.4.2 - responses: 0.18.0 - rsa: 4.9 - scikit-learn: 1.1.2 - scipy: 1.8.0 - setuptools: 59.8.0 - setuptools-scm: 7.0.5 - simplejson: 3.17.6 - six: 1.16.0 - soundfile: 0.10.3.post1 - stempeg: 0.2.3 - sympy: 1.12 - tensorboard: 2.11.0 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - threadpoolctl: 3.1.0 - tokenizers: 0.13.2 - tomli: 2.0.1 - torch: 1.12.1+cu113 - torchaudio: 0.12.1+cu113 - torchinfo: 1.5.3 - torchlibrosa: 0.0.9 - torchmetrics: 1.3.0.post0 - torchvision: 0.13.1+cu113 - tqdm: 4.64.1 - transformers: 4.26.1 - typing-extensions: 4.4.0 - urllib3: 1.26.12 - webrtcvad: 2.0.10 - werkzeug: 2.2.2 - wheel: 0.37.1 - xxhash: 3.2.0 - yarl: 1.8.1 - zipp: 3.9.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.18 - release: 5.15.0-92-generic - version: #102~20.04.1-Ubuntu SMP Mon Jan 15 13:09:14 UTC 2024

More info

Note: Since the version selection above is required, I chose one at random. In fact, my lightning version is 1.5.0.

awaelchli commented 6 months ago

@Blakey-Gavin This project you linked is 3 years old and the version there is pytorch_lightning==1.2.1. We don't maintain such an old version anymore, so I won't be able to help. Try to upgrade to pytorch_lightning==1.2.10, maybe it was fixed.

awaelchli commented 6 months ago

@Blakey-Gavin Is the code in bytedance/music_source_separation actually made to support multi-GPU? The examples there suggest that you would only run on one device. Would you mind opening an issue there and asking about this?

TidalPaladin commented 6 months ago

I've encountered a similar issue after updating drivers/CUDA. I was trying to add mamba to a working pipeline but was having issues with NVCC / CUDA being out of date. I followed these instructions to update CUDA which allowed me to build mamba's dependencies. After this update multi-GPU with DDP would hang on startup. I thought this was a mamba issue but it impacted previously working models and remained after a re-init of the virtual environment.

I can reproduce the issue with BoringModel in my virtual environment:

import pytorch_lightning as L
from pytorch_lightning.demos.boring_classes import BoringModel

model = BoringModel()
trainer = L.Trainer(max_epochs=10, devices=2, strategy="ddp", logger=None)
trainer.fit(model)

Hangs with

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/chase/.local/share/pdm/venvs/mammo-density-oNBH7oU9-mammo_density/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Missing logger folder: /home/chase/Documents/mammo-density/lightning_logs
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /home/chase/Documents/mammo-density/lightning_logs

Here are the driver / CUDA versions for that system:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:21:00.0 Off |                  N/A |
| 36%   38C    P8              23W / 350W |     42MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:4C:00.0  On |                  N/A |
|  0%   36C    P8              39W / 390W |     28MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2182      G   /usr/lib/xorg/Xorg                           21MiB |
|    0   N/A  N/A      2703      G   /usr/bin/gnome-shell                          9MiB |
|    1   N/A  N/A      2182      G   /usr/lib/xorg/Xorg                           21MiB |
+---------------------------------------------------------------------------------------+

Unfortunately I do not know the driver / CUDA versions prior to update. However I do have a separate system for which BoringModel runs fine in an identical virtual environment:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:21:00.0 Off |                  N/A |
|  0%   25C    P8              22W / 350W |      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:4C:00.0 Off |                  N/A |
|  0%   25C    P8              21W / 350W |      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Details for hanging system (Ubuntu 22.04.3):

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - available: True - version: 12.1 * Lightning: - lightning-bolts: 0.6.0.post1 - lightning-utilities: 0.10.1 - pytorch-lightning: 2.2.0 - torch: 2.2.0 - torch-dicom: 0.1.dev65+g6e1ffff - torchmetrics: 1.3.1 - torchvision: 0.17.0 * Packages: - aiohttp: 3.9.3 - aiosignal: 1.3.1 - albumentations: 1.3.1 - antlr4-python3-runtime: 4.9.3 - appdirs: 1.4.4 - attrs: 23.2.0 - autoflake: 2.2.1 - autopep8: 2.0.4 - black: 24.1.1 - certifi: 2024.2.2 - charset-normalizer: 3.3.2 - click: 8.1.7 - colorama: 0.4.6 - contourpy: 1.2.0 - coverage: 7.4.1 - cycler: 0.12.1 - deep-helpers: 0.1.dev56+g85ab3ac - dicom-anonymizer: 1.0.7 - dicom-utils: 0.1.dev107+gd0bc243 - docker-pycreds: 0.4.0 - docstring-parser: 0.15 - einops: 0.7.0 - fancycompleter: 0.9.1 - filelock: 3.13.1 - flake8: 7.0.0 - fonttools: 4.48.1 - frozenlist: 1.4.1 - fsspec: 2024.2.0 - gitdb: 4.0.11 - gitpython: 3.1.41 - huggingface-hub: 0.20.3 - idna: 3.6 - imageio: 2.33.1 - importlib-resources: 6.1.1 - iniconfig: 2.0.0 - isort: 5.13.2 - jinja2: 3.1.3 - joblib: 1.3.2 - jsonargparse: 4.27.5 - kiwisolver: 1.4.5 - lazy-loader: 0.3 - lightning-bolts: 0.6.0.post1 - lightning-utilities: 0.10.1 - mammo-density: 0.1.dev12+gca376d7 - markupsafe: 2.1.5 - matplotlib: 3.8.2 - mccabe: 0.7.0 - monai: 1.3.0 - mpmath: 1.3.0 - multidict: 6.0.5 - mypy-extensions: 1.0.0 - networkx: 3.2.1 - numpy: 1.26.4 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.19.3 - nvidia-nvjitlink-cu12: 12.3.101 - nvidia-nvtx-cu12: 12.1.105 - omegaconf: 2.3.0 - opencv-python: 4.9.0.80 - opencv-python-headless: 4.9.0.80 - packaging: 23.2 - pandas: 2.2.0 - pathspec: 0.12.1 - pdbpp: 0.10.3 - pillow: 10.2.0 - platformdirs: 4.2.0 - pluggy: 1.4.0 - protobuf: 4.25.2 - psutil: 5.9.8 - pycodestyle: 2.11.1 - pydicom: 2.4.4 - pyflakes: 3.2.0 - pygments: 2.17.2 - pylibjpeg: 2.0.0 - pylibjpeg-libjpeg: 2.0.2 - pylibjpeg-openjpeg: 2.1.1 - pyparsing: 3.1.1 - pyrepl: 0.9.0 - pytest: 8.0.0 - pytest-cov: 4.1.0 - pytest-mock: 3.12.0 - python-dateutil: 2.8.2 - pytorch-lightning: 2.2.0 - pytz: 2024.1 - pyyaml: 6.0.1 - qudida: 0.0.4 - registry: 0.1.1.dev15+g9eed0e5 - requests: 2.31.0 - safetensors: 0.4.2 - scikit-image: 0.22.0 - scikit-learn: 1.4.0 - scipy: 1.12.0 - sentry-sdk: 1.40.2 - setproctitle: 1.3.3 - setuptools: 69.0.3 - six: 1.16.0 - smmap: 5.0.1 - strenum: 0.4.15 - sympy: 1.12 - threadpoolctl: 3.2.0 - tifffile: 2024.1.30 - timm: 0.9.12 - torch: 2.2.0 - torch-dicom: 0.1.dev65+g6e1ffff - torchmetrics: 1.3.1 - torchvision: 0.17.0 - tqdm: 4.66.1 - tqdm-multiprocessing: 0.1.0 - triton: 2.2.0 - typeshed-client: 2.4.0 - typing-extensions: 4.9.0 - tzdata: 2023.4 - urllib3: 2.2.0 - wandb: 0.16.3 - wmctrl: 0.5 - yarl: 1.9.4 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.11.7 - release: 5.15.0-94-generic - version: #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024

Details for working system:

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3090 - NVIDIA GeForce RTX 3090 - available: True - version: 12.1 * Lightning: - lightning-bolts: 0.6.0.post1 - lightning-utilities: 0.10.1 - pytorch-lightning: 2.2.0 - torch: 2.2.0 - torch-dicom: 0.1.dev65+g6e1ffff - torchmetrics: 1.3.1 - torchvision: 0.17.0 * Packages: - aiohttp: 3.9.3 - aiosignal: 1.3.1 - albumentations: 1.3.1 - antlr4-python3-runtime: 4.9.3 - appdirs: 1.4.4 - attrs: 23.2.0 - autoflake: 2.2.1 - autopep8: 2.0.4 - black: 24.1.1 - certifi: 2024.2.2 - charset-normalizer: 3.3.2 - click: 8.1.7 - colorama: 0.4.6 - contourpy: 1.2.0 - coverage: 7.4.1 - cycler: 0.12.1 - deep-helpers: 0.1.dev56+g85ab3ac - dicom-anonymizer: 1.0.7 - dicom-utils: 0.1.dev107+gd0bc243 - docker-pycreds: 0.4.0 - docstring-parser: 0.15 - einops: 0.7.0 - fancycompleter: 0.9.1 - filelock: 3.13.1 - flake8: 7.0.0 - fonttools: 4.48.1 - frozenlist: 1.4.1 - fsspec: 2024.2.0 - gitdb: 4.0.11 - gitpython: 3.1.41 - huggingface-hub: 0.20.3 - idna: 3.6 - imageio: 2.33.1 - importlib-resources: 6.1.1 - iniconfig: 2.0.0 - isort: 5.13.2 - jinja2: 3.1.3 - joblib: 1.3.2 - jsonargparse: 4.27.5 - kiwisolver: 1.4.5 - lazy-loader: 0.3 - lightning-bolts: 0.6.0.post1 - lightning-utilities: 0.10.1 - mammo-density: 0.1.dev13+g8760e3d.d20240214 - markupsafe: 2.1.5 - matplotlib: 3.8.2 - mccabe: 0.7.0 - monai: 1.3.0 - mpmath: 1.3.0 - multidict: 6.0.5 - mypy-extensions: 1.0.0 - networkx: 3.2.1 - numpy: 1.26.4 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.19.3 - nvidia-nvjitlink-cu12: 12.3.101 - nvidia-nvtx-cu12: 12.1.105 - omegaconf: 2.3.0 - opencv-python: 4.9.0.80 - opencv-python-headless: 4.9.0.80 - packaging: 23.2 - pandas: 2.2.0 - pathspec: 0.12.1 - pdbpp: 0.10.3 - pillow: 10.2.0 - platformdirs: 4.2.0 - pluggy: 1.4.0 - protobuf: 4.25.2 - psutil: 5.9.8 - pycodestyle: 2.11.1 - pydicom: 2.4.4 - pyflakes: 3.2.0 - pygments: 2.17.2 - pylibjpeg: 2.0.0 - pylibjpeg-libjpeg: 2.0.2 - pylibjpeg-openjpeg: 2.1.1 - pyparsing: 3.1.1 - pyrepl: 0.9.0 - pytest: 8.0.0 - pytest-cov: 4.1.0 - pytest-mock: 3.12.0 - python-dateutil: 2.8.2 - pytorch-lightning: 2.2.0 - pytz: 2024.1 - pyyaml: 6.0.1 - qudida: 0.0.4 - registry: 0.1.1.dev15+g9eed0e5 - requests: 2.31.0 - safetensors: 0.4.2 - scikit-image: 0.22.0 - scikit-learn: 1.4.0 - scipy: 1.12.0 - sentry-sdk: 1.40.2 - setproctitle: 1.3.3 - setuptools: 69.0.3 - six: 1.16.0 - smmap: 5.0.1 - strenum: 0.4.15 - sympy: 1.12 - threadpoolctl: 3.2.0 - tifffile: 2024.1.30 - timm: 0.9.12 - torch: 2.2.0 - torch-dicom: 0.1.dev65+g6e1ffff - torchmetrics: 1.3.1 - torchvision: 0.17.0 - tqdm: 4.66.1 - tqdm-multiprocessing: 0.1.0 - triton: 2.2.0 - typeshed-client: 2.4.0 - typing-extensions: 4.9.0 - tzdata: 2023.4 - urllib3: 2.2.0 - wandb: 0.16.3 - wmctrl: 0.5 - yarl: 1.9.4 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.11.5 - release: 6.5.7-artix1-1 - version: #1 SMP PREEMPT_DYNAMIC Sun, 15 Oct 2023 22:13:26 +0000
awaelchli commented 6 months ago

@TidalPaladin The fact that it hangs with such a simple boring model must mean it will also fail with a regular torch distributed example.

Try one of the PyTorch multi-GPU examples, for example this one https://github.com/pytorch/examples/tree/main/distributed/ddp (I think this one you can just run main.py).

When you run these scripts, also enable NCCL_DEBUG=INFO.

You will most likely see the same hang, showing it's a problem with your system and not Lightning. A common issue is that P2P is not working. To confirm this, set NCCL_P2P_DISABLE=1. If it works, it will be super slow but it confirms that P2P is bad.

awaelchli commented 6 months ago

Also, I don't know why you needed a new driver, but since you said 535.113.0 works and 545.23.08 doesn't, try a version inbetween, that's still new enough for your needs. Also, the fact that you are able to isolate it to the driver is good. If all above does not help, I suggest posting in the PyTorch forum.

TidalPaladin commented 6 months ago

@awaelchli Thank you for your help. Setting NCCL_P2P_DISABLE=1 did indeed resolve the problem. I was able to revert to 535.154.05 + CUDA 12.2 and now everything works normally.

awaelchli commented 6 months ago

@Blakey-Gavin Can you try this suggestion as well?

Blakey-Gavin commented 6 months ago

@Blakey-Gavin This project you linked is 3 years old and the version there is pytorch_lightning==1.2.1. We don't maintain such an old version anymore, so I won't be able to help. Try to upgrade to pytorch_lightning==1.2.10, maybe it was fixed.

Sorry for the late reply, I'll try your suggestions, thanks.

Blakey-Gavin commented 6 months ago

@Blakey-Gavin Is the code in bytedance/music_source_separation actually made to support multi-GPU? The examples there suggest that you would only run on one device. Would you mind opening an issue there and asking about this?

Yes, I have used this project before. Before my driver was upgraded, if I configured the environment according to the "requirements.txt" it provided, there would be no problem with multi-GPUs training. However, training got stuck after the driver was upgraded.

Blakey-Gavin commented 6 months ago

@Blakey-Gavin Can you try this suggestion as well?

I have tried this(NCCL_P2P_DISABLE=1) before raising this issue, and it can indeed temporarily solve the problem. However, while the speed is slowing down, the main GPU also takes up more memory than other GPUs, so this actually leads to a new problem.