Training Stops at Initialization with Multi-GPU Setup on Local Machine

Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.

Command used to run training with app/main.py:

python main.py --fname=configs/pretrain/vitl16.yaml --devices cuda:0 cuda:1 cuda:2 cuda:3

Output

[INFO    ][2024-10-24 17:10:42][process_main             ] called-params configs/pretrain/vitl16.yaml
[INFO    ][2024-10-24 17:10:42][process_main             ] loaded params...
{   'app': 'vjepa',
    'data': {   'batch_size': 4,
                'clip_duration': None,
                'crop_size': 224,
                'dataset_type': 'XXXDataset',
                'datasets': [   '/home/xxx/xxx/jepa/src/datasets/xxx.csv'],
                'decode_one_clip': True,
                'filter_short_videos': False,
                'num_clips': 1,
                'num_frames': 16,
                'num_workers': 0,
                'patch_size': 16,
                'pin_mem': True,
                'sampling_rate': 1,
                'tubelet_size': 2},
    'data_aug': {   'auto_augment': False,
                    'motion_shift': False,
                    'random_resize_aspect_ratio': [0.75, 1.35],
                    'random_resize_scale': [0.3, 1.0],
                    'reprob': 0.0},
    'logging': {   'folder': '/home/xxx/xxx/jepa/evals/',
                   'write_tag': 'jepa'},
    'loss': {'loss_exp': 1.0, 'reg_coeff': 0.0},
    'mask': [   {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 8,
                    'spatial_scale': [0.15, 0.15],
                    'temporal_scale': [1.0, 1.0]},
                {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 2,
                    'spatial_scale': [0.7, 0.7],
                    'temporal_scale': [1.0, 1.0]}],
    'meta': {   'dtype': 'bfloat16',
                'eval_freq': 100,
                'load_checkpoint': True,
                'read_checkpoint': 'vitl16.pth.tar',
                'seed': 234,
                'use_sdpa': True},
    'model': {   'model_name': 'vit_large',
                 'pred_depth': 12,
                 'pred_embed_dim': 384,
                 'uniform_power': True,
                 'use_mask_tokens': True,
                 'zero_init_mask_tokens': True},
    'nodes': 1,
    'optimization': {   'clip_grad': 10.0,
                        'ema': [0.998, 1.0],
                        'epochs': 300,
                        'final_lr': 1e-06,
                        'final_weight_decay': 0.4,
                        'ipe': 300,
                        'ipe_scale': 1.25,
                        'lr': 0.000625,
                        'start_lr': 0.0002,
                        'warmup': 40,
                        'weight_decay': 0.04},
    'tasks_per_node': 4}
[INFO    ][2024-10-24 17:10:44][process_main             ] Running... (rank: 0/4)
[INFO    ][2024-10-24 17:10:44][main                     ] Running pre-training of app: vjepa

Environment: Operating System: Ubuntu 24.04 LTS x86_64 Python version: 3.9 PyTorch version: 2.4.1 CUDA version: 12.1 NCCL version: 2.20.5 GPUs: 4 x NVIDIA RTX A5000

What I've Tried:

Verified that all GPUs are visible and available using nvidia-smi.
Verified that CUDA_VISIBLE_DEVICES is set correctly for each process.
Attempted to run the script on fewer GPUs (e.g., 1 or 2 GPUs) but faced the same issue.
Tried changing the main.py but couldn't solve the problem

facebookresearch / jepa

Training Stops at Initialization with Multi-GPU Setup on Local Machine #83