facebookresearch / jepa

PyTorch code and models for V-JEPA self-supervised learning from video.
Other
2.68k stars 254 forks source link

Crashes after first epoch because of leaked semaphores #36

Open jackhawa opened 8 months ago

jackhawa commented 8 months ago

Hi, I am running an evaluation on a small dataset (train dataset of 22 labeled videos and val dataset of 2 labeled videos.). It crashes after the first epoch after my RAM gets maxed out.

Error received: /opt/conda/envs/jepa-p10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 88 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Here is config file:

{   'data': {   'dataset_train': '/home/ubuntu/dev/jepa/val_dataset.csv',
                'dataset_type': 'VideoDataset',
                'dataset_val': '/home/ubuntu/dev/jepa/train_dataset.csv',
                'frame_step': 4,
                'frames_per_clip': 16,
                'num_classes': 2,
                'num_segments': 2,
                'num_views_per_segment': 3},
    'eval_name': 'video_classification_frozen',
    'nodes': 1,
    'optimization': {   'attend_across_segments': True,
                        'batch_size': 1,
                        'final_lr': 0.0,
                        'lr': 0.001,
                        'num_epochs': 20,
                        'resolution': 224,
                        'start_lr': 0.001,
                        'use_bfloat16': True,
                        'warmup': 0.0,
                        'weight_decay': 0.01},
    'pretrain': {   'checkpoint': 'vitl16.pth.tar',
                    'checkpoint_key': 'target_encoder',
                    'clip_duration': None,
                    'folder': './',
                    'frames_per_clip': 16,
                    'model_name': 'vit_large',
                    'patch_size': 16,
                    'tight_silu': False,
                    'tubelet_size': 2,
                    'uniform_power': True,
                    'use_sdpa': True,
                    'use_silu': False,
                    'write_tag': 'jepa'},
    'resume_checkpoint': False,
    'tag': 'ssv2-16x2x3',
    'tasks_per_node': 1}

Please assist.

tomarvimal commented 8 months ago

got the same error today after 1 epoch and the memory consumption during train is blasphemous. Taking nearly 25G of RAM..!!!