facebookresearch / jepa

PyTorch code and models for V-JEPA self-supervised learning from video.
Other
2.68k stars 254 forks source link

Training Stops at Initialization with Multi-GPU Setup on Local Machine #83

Open qwertymert opened 3 weeks ago

qwertymert commented 3 weeks ago

Problem description: When attempting to run distributed training using multiple GPUs on a single machine, the training process gets stopped at the very beginning. The code initializes correctly without any errors but the code stops before starting the training process.

Command used to run training with app/main.py:

python main.py --fname=configs/pretrain/vitl16.yaml --devices cuda:0 cuda:1 cuda:2 cuda:3

Output

[INFO    ][2024-10-24 17:10:42][process_main             ] called-params configs/pretrain/vitl16.yaml
[INFO    ][2024-10-24 17:10:42][process_main             ] loaded params...
{   'app': 'vjepa',
    'data': {   'batch_size': 4,
                'clip_duration': None,
                'crop_size': 224,
                'dataset_type': 'XXXDataset',
                'datasets': [   '/home/xxx/xxx/jepa/src/datasets/xxx.csv'],
                'decode_one_clip': True,
                'filter_short_videos': False,
                'num_clips': 1,
                'num_frames': 16,
                'num_workers': 0,
                'patch_size': 16,
                'pin_mem': True,
                'sampling_rate': 1,
                'tubelet_size': 2},
    'data_aug': {   'auto_augment': False,
                    'motion_shift': False,
                    'random_resize_aspect_ratio': [0.75, 1.35],
                    'random_resize_scale': [0.3, 1.0],
                    'reprob': 0.0},
    'logging': {   'folder': '/home/xxx/xxx/jepa/evals/',
                   'write_tag': 'jepa'},
    'loss': {'loss_exp': 1.0, 'reg_coeff': 0.0},
    'mask': [   {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 8,
                    'spatial_scale': [0.15, 0.15],
                    'temporal_scale': [1.0, 1.0]},
                {   'aspect_ratio': [0.75, 1.5],
                    'max_keep': None,
                    'max_temporal_keep': 1.0,
                    'num_blocks': 2,
                    'spatial_scale': [0.7, 0.7],
                    'temporal_scale': [1.0, 1.0]}],
    'meta': {   'dtype': 'bfloat16',
                'eval_freq': 100,
                'load_checkpoint': True,
                'read_checkpoint': 'vitl16.pth.tar',
                'seed': 234,
                'use_sdpa': True},
    'model': {   'model_name': 'vit_large',
                 'pred_depth': 12,
                 'pred_embed_dim': 384,
                 'uniform_power': True,
                 'use_mask_tokens': True,
                 'zero_init_mask_tokens': True},
    'nodes': 1,
    'optimization': {   'clip_grad': 10.0,
                        'ema': [0.998, 1.0],
                        'epochs': 300,
                        'final_lr': 1e-06,
                        'final_weight_decay': 0.4,
                        'ipe': 300,
                        'ipe_scale': 1.25,
                        'lr': 0.000625,
                        'start_lr': 0.0002,
                        'warmup': 40,
                        'weight_decay': 0.04},
    'tasks_per_node': 4}
[INFO    ][2024-10-24 17:10:44][process_main             ] Running... (rank: 0/4)
[INFO    ][2024-10-24 17:10:44][main                     ] Running pre-training of app: vjepa

Environment: Operating System: Ubuntu 24.04 LTS x86_64 Python version: 3.9 PyTorch version: 2.4.1 CUDA version: 12.1 NCCL version: 2.20.5 GPUs: 4 x NVIDIA RTX A5000

What I've Tried: