I am trying to pretraining the model using the train_pubchem_light.py on multiple GPUs across multiple nodes. When I run the code (with some modification for LSF to SLURM). For a multinode mullgpu run I am getting the following log:
fused_weight_gradient_mlp_cuda module not found. gradient accumulation fusion with weight gradient computation disabled.
Using custom data configuration default-aa670d5a2f66e342
Reusing dataset pub_chem (/tmp/silabrata.93/pubchem/pub_chem/default-aa670d5a2f66e342/0.0.0/a6cf7324273df2f2ba223646cb9ecbb36c120aab97c28e8378a3afe9dca94289)
['g073.cluster', 'g074.cluster']
g073.cluster
['g073.cluster']
g073.cluster MASTER_ADDR: g073.cluster
g073.cluster MASTER_PORT: 54966
g073.cluster NODE_RANK 0
g073.cluster NCCL_SOCKET_IFNAME: ib0
g073.cluster NCCL_DEBUG: INFO
g073.cluster NCCL_IB_CUDA_SUPPORT: 1
Using 2 Nodes---------------------------------------------------------------------
Using 2 GPUs---------------------------------------------------------------------
{'batch_size': 500, 'num_workers': 4, 'pin_memory': True}
../data/ZINC/CCEA.smi
../data/ZINC/BIBD.smi
../data/ZINC/DDCD.smi
../data/ZINC/DAEA.smi
../data/ZINC/CBCB.smi
../data/ZINC/BICA.smi
../data/ZINC/CKCA.smi
../data/ZINC/CACA.smi
../data/ZINC/AGAB.smi
../data/ZINC/BAEA.smi
../data/ZINC/BECB.smi
../data/ZINC/CEAA.smi
../data/ZINC/BJEA.smi
../data/ZINC/BKEB.smi
../data/ZINC/AKEA.smi
a configuration default-8780d85d721ae806
Reusing dataset zinc (/tmp/silabrata.93/zinc/zinc/default-8780d85d721ae806/0.0.0/81866737128724210ed5dd6e7ce0e6a97d5dca6bdcee21f0d8ab6d3e1fa68ac5)
Global seed set to 12345
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank (0).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
fused_weight_gradient_mlp_cuda module not found. gradient accumulation fusion with weight gradient computation disabled.
Using custom data configuration default-aa670d5a2f66e342
Reusing dataset pub_chem (/tmp/silabrata.93/pubchem/pub_chem/default-aa670d5a2f66e342/0.0.0/a6cf7324273df2f2ba223646cb9ecbb36c120aab97c28e8378a3afe9dca94289)
['g073.cluster', 'g074.cluster']
g073.cluster
['g073.cluster']
g073.cluster MASTER_ADDR: g073.cluster
g073.cluster MASTER_PORT: 54966
g073.cluster NODE_RANK 0
g073.cluster NCCL_SOCKET_IFNAME: ib0
g073.cluster NCCL_DEBUG: INFO
g073.cluster NCCL_IB_CUDA_SUPPORT: 1
Using 2 Nodes---------------------------------------------------------------------
Using 2 GPUs---------------------------------------------------------------------
{'batch_size': 500, 'num_workers': 4, 'pin_memory': True}
../data/ZINC/CCEA.smi
../data/ZINC/BIBD.smi
../data/ZINC/DDCD.smi
../data/ZINC/DAEA.smi
../data/ZINC/CBCB.smi
../data/ZINC/BICA.smi
Using custom data configuration default-8780d85d721ae806
Reusing dataset zinc (/tmp/silabrata.93/zinc/zinc/default-8780d85d721ae806/0.0.0/81866737128724210ed5dd6e7ce0e6a97d5dca6bdcee21f0d8ab6d3e1fa68ac5)
Global seed set to 12345
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Global seed set to 12345
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 12345
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
g073:190341:190341 [0] NCCL INFO Bootstrap : Using [0]ib0:10.73.133.90<0>
g073:190341:190341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
g073:190341:190341 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.73.133.90<0>
g073:190341:190341 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda11.0
g073:190541:190541 [1] NCCL INFO Bootstrap : Using [0]ib0:10.73.133.90<0>
g073:190541:190541 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
g073:190541:190541 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.73.133.90<0>
073:190541:190541 [1] NCCL INFO Using network IB
However, the training doe does not start and the code kind of hangs here, Although the code continues to run but fails to propagate and show the losses. Can someone help me out with why it is getting stuck. It seems that master address, nodes and gpus are getting detected properly.
I am trying to pretraining the model using the train_pubchem_light.py on multiple GPUs across multiple nodes. When I run the code (with some modification for LSF to SLURM). For a multinode mullgpu run I am getting the following log:
fused_weight_gradient_mlp_cuda
module not found. gradient accumulation fusion with weight gradient computation disabled. Using custom data configuration default-aa670d5a2f66e342 Reusing dataset pub_chem (/tmp/silabrata.93/pubchem/pub_chem/default-aa670d5a2f66e342/0.0.0/a6cf7324273df2f2ba223646cb9ecbb36c120aab97c28e8378a3afe9dca94289) ['g073.cluster', 'g074.cluster'] g073.cluster ['g073.cluster'] g073.cluster MASTER_ADDR: g073.cluster g073.cluster MASTER_PORT: 54966 g073.cluster NODE_RANK 0 g073.cluster NCCL_SOCKET_IFNAME: ib0 g073.cluster NCCL_DEBUG: INFO g073.cluster NCCL_IB_CUDA_SUPPORT: 1 Using 2 Nodes--------------------------------------------------------------------- Using 2 GPUs--------------------------------------------------------------------- {'batch_size': 500, 'num_workers': 4, 'pin_memory': True} ../data/ZINC/CCEA.smi ../data/ZINC/BIBD.smi ../data/ZINC/DDCD.smi ../data/ZINC/DAEA.smi ../data/ZINC/CBCB.smi ../data/ZINC/BICA.smi ../data/ZINC/CKCA.smi ../data/ZINC/CACA.smi ../data/ZINC/AGAB.smi ../data/ZINC/BAEA.smi ../data/ZINC/BECB.smi ../data/ZINC/CEAA.smi ../data/ZINC/BJEA.smi ../data/ZINC/BKEB.smi ../data/ZINC/AKEA.smi a configuration default-8780d85d721ae806 Reusing dataset zinc (/tmp/silabrata.93/zinc/zinc/default-8780d85d721ae806/0.0.0/81866737128724210ed5dd6e7ce0e6a97d5dca6bdcee21f0d8ab6d3e1fa68ac5) Global seed set to 12345 GPU available: True, used: True TPU available: None, using: 0 TPU cores Using environment variable NODE_RANK for node rank (0). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]fused_weight_gradient_mlp_cuda
module not found. gradient accumulation fusion with weight gradient computation disabled. Using custom data configuration default-aa670d5a2f66e342 Reusing dataset pub_chem (/tmp/silabrata.93/pubchem/pub_chem/default-aa670d5a2f66e342/0.0.0/a6cf7324273df2f2ba223646cb9ecbb36c120aab97c28e8378a3afe9dca94289) ['g073.cluster', 'g074.cluster'] g073.cluster ['g073.cluster'] g073.cluster MASTER_ADDR: g073.cluster g073.cluster MASTER_PORT: 54966 g073.cluster NODE_RANK 0 g073.cluster NCCL_SOCKET_IFNAME: ib0 g073.cluster NCCL_DEBUG: INFO g073.cluster NCCL_IB_CUDA_SUPPORT: 1 Using 2 Nodes--------------------------------------------------------------------- Using 2 GPUs--------------------------------------------------------------------- {'batch_size': 500, 'num_workers': 4, 'pin_memory': True} ../data/ZINC/CCEA.smi ../data/ZINC/BIBD.smi ../data/ZINC/DDCD.smi ../data/ZINC/DAEA.smi ../data/ZINC/CBCB.smi ../data/ZINC/BICA.smi Using custom data configuration default-8780d85d721ae806 Reusing dataset zinc (/tmp/silabrata.93/zinc/zinc/default-8780d85d721ae806/0.0.0/81866737128724210ed5dd6e7ce0e6a97d5dca6bdcee21f0d8ab6d3e1fa68ac5) Global seed set to 12345 LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] Global seed set to 12345 initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4 Global seed set to 12345 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 g073:190341:190341 [0] NCCL INFO Bootstrap : Using [0]ib0:10.73.133.90<0> g073:190341:190341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation g073:190341:190341 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.73.133.90<0> g073:190341:190341 [0] NCCL INFO Using network IB NCCL version 2.7.8+cuda11.0 g073:190541:190541 [1] NCCL INFO Bootstrap : Using [0]ib0:10.73.133.90<0> g073:190541:190541 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation g073:190541:190541 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.73.133.90<0> 073:190541:190541 [1] NCCL INFO Using network IBHowever, the training doe does not start and the code kind of hangs here, Although the code continues to run but fails to propagate and show the losses. Can someone help me out with why it is getting stuck. It seems that master address, nodes and gpus are getting detected properly.