Issues with pretraining the model in multiple GPU multiple nodes

I am trying to pretraining the model using the train_pubchem_light.py on multiple GPUs across multiple nodes. When I run the code (with some modification for LSF to SLURM). For a multinode mullgpu run I am getting the following log:

fused_weight_gradient_mlp_cuda module not found. gradient accumulation fusion with weight gradient computation disabled. Using custom data configuration default-aa670d5a2f66e342 Reusing dataset pub_chem (/tmp/silabrata.93/pubchem/pub_chem/default-aa670d5a2f66e342/0.0.0/a6cf7324273df2f2ba223646cb9ecbb36c120aab97c28e8378a3afe9dca94289) ['g073.cluster', 'g074.cluster'] g073.cluster ['g073.cluster'] g073.cluster MASTER_ADDR: g073.cluster g073.cluster MASTER_PORT: 54966 g073.cluster NODE_RANK 0 g073.cluster NCCL_SOCKET_IFNAME: ib0 g073.cluster NCCL_DEBUG: INFO g073.cluster NCCL_IB_CUDA_SUPPORT: 1 Using 2 Nodes--------------------------------------------------------------------- Using 2 GPUs--------------------------------------------------------------------- {'batch_size': 500, 'num_workers': 4, 'pin_memory': True} ../data/ZINC/CCEA.smi ../data/ZINC/BIBD.smi ../data/ZINC/DDCD.smi ../data/ZINC/DAEA.smi ../data/ZINC/CBCB.smi ../data/ZINC/BICA.smi ../data/ZINC/CKCA.smi ../data/ZINC/CACA.smi ../data/ZINC/AGAB.smi ../data/ZINC/BAEA.smi ../data/ZINC/BECB.smi ../data/ZINC/CEAA.smi ../data/ZINC/BJEA.smi ../data/ZINC/BKEB.smi ../data/ZINC/AKEA.smi a configuration default-8780d85d721ae806 Reusing dataset zinc (/tmp/silabrata.93/zinc/zinc/default-8780d85d721ae806/0.0.0/81866737128724210ed5dd6e7ce0e6a97d5dca6bdcee21f0d8ab6d3e1fa68ac5) Global seed set to 12345 GPU available: True, used: True TPU available: None, using: 0 TPU cores Using environment variable NODE_RANK for node rank (0). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] fused_weight_gradient_mlp_cuda module not found. gradient accumulation fusion with weight gradient computation disabled. Using custom data configuration default-aa670d5a2f66e342 Reusing dataset pub_chem (/tmp/silabrata.93/pubchem/pub_chem/default-aa670d5a2f66e342/0.0.0/a6cf7324273df2f2ba223646cb9ecbb36c120aab97c28e8378a3afe9dca94289) ['g073.cluster', 'g074.cluster'] g073.cluster ['g073.cluster'] g073.cluster MASTER_ADDR: g073.cluster g073.cluster MASTER_PORT: 54966 g073.cluster NODE_RANK 0 g073.cluster NCCL_SOCKET_IFNAME: ib0 g073.cluster NCCL_DEBUG: INFO g073.cluster NCCL_IB_CUDA_SUPPORT: 1 Using 2 Nodes--------------------------------------------------------------------- Using 2 GPUs--------------------------------------------------------------------- {'batch_size': 500, 'num_workers': 4, 'pin_memory': True} ../data/ZINC/CCEA.smi ../data/ZINC/BIBD.smi ../data/ZINC/DDCD.smi ../data/ZINC/DAEA.smi ../data/ZINC/CBCB.smi ../data/ZINC/BICA.smi Using custom data configuration default-8780d85d721ae806 Reusing dataset zinc (/tmp/silabrata.93/zinc/zinc/default-8780d85d721ae806/0.0.0/81866737128724210ed5dd6e7ce0e6a97d5dca6bdcee21f0d8ab6d3e1fa68ac5) Global seed set to 12345 LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] Global seed set to 12345 initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4 Global seed set to 12345 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 g073:190341:190341 [0] NCCL INFO Bootstrap : Using [0]ib0:10.73.133.90<0> g073:190341:190341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation g073:190341:190341 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.73.133.90<0> g073:190341:190341 [0] NCCL INFO Using network IB NCCL version 2.7.8+cuda11.0 g073:190541:190541 [1] NCCL INFO Bootstrap : Using [0]ib0:10.73.133.90<0> g073:190541:190541 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation g073:190541:190541 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.73.133.90<0> 073:190541:190541 [1] NCCL INFO Using network IB

However, the training doe does not start and the code kind of hangs here, Although the code continues to run but fails to propagate and show the losses. Can someone help me out with why it is getting stuck. It seems that master address, nodes and gpus are getting detected properly.

IBM / molformer

Issues with pretraining the model in multiple GPU multiple nodes #16