Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.
Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.
Command used to run training with app/main.py:
Output
Environment: Operating System: Ubuntu 24.04 LTS x86_64 Python version: 3.9 PyTorch version: 2.4.1 CUDA version: 12.1 NCCL version: 2.20.5 GPUs: 4 x NVIDIA RTX A5000
What I've Tried: