facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.25k stars 331 forks source link

CPU OOM when resuming training from a given checkpoint #470

Closed CharlieCheckpt closed 2 years ago

CharlieCheckpt commented 2 years ago

Hi,

I tried to resume the training of a mocov2 experiment from a given checkpoint, but got a CPU OOM . This is weird because I did not encounter this issue in the initial run, and I have the exact same resources. It seems linked to #235 but I couldn't find a way to make it work. Do you have any idea what can cause this ? And how to solve it ?

Instructions To Reproduce the 🐛 Bug:

  1. what changes did you made None

  2. what exact command you run:

export PYTHONPATH="$EXP_ROOT_DIR/:$PYTHONPATH"
python -u "$EXP_ROOT_DIR/run_distributed_engines_2.py" \
  "${CFG[@]}" \
  hydra.run.dir="$EXP_ROOT_DIR" \
  config.DISTRIBUTED.NUM_NODES=4 \
  config.DISTRIBUTED.NUM_PROC_PER_NODE=4 \
  config.SLURM.USE_SLURM=true \
  config.SLURM.PARTITION="" \
  config.SLURM.CONSTRAINT="v100-32g" \
  config.SLURM.MEM_GB=160 \
  config.SLURM.NUM_CPU_PER_PROC=10 \
  config.DATA.NUM_DATALOADER_WORKERS=10 \
  config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=128 \
  config.SLURM.TIME_HOURS=20 \
  config.SLURM.LOG_FOLDER="$EXP_ROOT_DIR" \
  config.CHECKPOINT.DIR="$CHECKPOINT_DIR" \
  config.DATA.TRAIN.DATASET_NAMES=[ssl_coad_dataset] \
  config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
  config.DATA.TRAIN.DATA_PATHS=["xxx"]
  1. what you observed (including full logs): The .out log of one of the node:
submitit INFO (2021-11-14 20:41:39,734) - Starting with JobEnvironment(job_id=1998652, hostname=r9i6n7, local_rank=0(1), node=3(4), global_rank=3(4))
submitit INFO (2021-11-14 20:41:39,735) - Loading pickle: /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/1998652_submitted.pkl
INFO 2021-11-14 20:41:42,408 slurm.py:  21: SLURM job: node_name: r9i6n7, node_id: 3
INFO 2021-11-14 20:41:42,579 checkpoint.py: 593: checkpoint_resume_num: 0
INFO 2021-11-14 20:41:42,580 checkpoint.py: 630: Resume from file: /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//model_iteration59000.torch
INFO 2021-11-14 20:41:44,467 train.py:  94: Env set for rank: 1, dist_rank: 13
INFO 2021-11-14 20:41:44,467 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:41:44,467 train.py: 105: Setting seed....
INFO 2021-11-14 20:41:44,467 misc.py: 173: MACHINE SEED: 2600
INFO 2021-11-14 20:41:44,468 train.py:  94: Env set for rank: 0, dist_rank: 12
INFO 2021-11-14 20:41:44,469 env.py:  50: ALL_CCFRSCRATCH:  /gpfsscratch/rech/htc/commun
INFO 2021-11-14 20:41:44,469 env.py:  50: ALL_CCFRSTORE:    /gpfsstore/rech/htc/commun
INFO 2021-11-14 20:41:44,469 env.py:  50: ALL_CCFRWORK: /gpfswork/rech/htc/commun
INFO 2021-11-14 20:41:44,469 env.py:  50: BASH_ENV: /gpfslocalsup/spack_soft/environment-modules/4.3.1/gcc-4.8.5-ism7cdy4xverxywj27jvjstqwk5oxe2v/init/bash

#### removed env variables print for github just in case

INFO 2021-11-14 20:41:44,480 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:41:44,480 train.py: 105: Setting seed....
INFO 2021-11-14 20:41:44,480 misc.py: 173: MACHINE SEED: 2400
WARNING 2021-11-14 20:41:44,509 moco_hooks.py:  45: Batch shuffling: True
INFO 2021-11-14 20:41:44,509 tensorboard.py:  49: Tensorboard dir: /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//tb_logs
INFO 2021-11-14 20:41:44,514 tensorboard_hook.py:  90: Setting up SSL Tensorboard Hook...
INFO 2021-11-14 20:41:44,514 tensorboard_hook.py: 103: Tensorboard config: log_params: False, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-11-14 20:41:44,516 trainer_main.py: 113: Using Distributed init method: tcp://r7i0n8:40050, world_size: 16, rank: 13
INFO 2021-11-14 20:41:44,523 hydra_config.py: 132: Training with config:
INFO 2021-11-14 20:41:44,530 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False,
                'AUTO_RESUME': True,
                'BACKEND': 'disk',
                'CHECKPOINT_FREQUENCY': 1,
                'CHECKPOINT_ITER_FREQUENCY': 1000,
                'DIR': '/gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints/',
                'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1,
                'OVERWRITE_EXISTING': False,
                'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False},
 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss',
                'DATA_LIMIT': -1,
                'DATA_LIMIT_SAMPLING': {'SEED': 0},
                'FEATURES': {'DATASET_NAME': '',
                             'DATA_PARTITION': 'TRAIN',
                             'DIMENSIONALITY_REDUCTION': 0,
                             'EXTRACT': False,
                             'LAYER_NAME': '',
                             'PATH': '.',
                             'TEST_PARTITION': 'TEST'},
                'NUM_CLUSTERS': 16000,
                'NUM_ITER': 50,
                'OUTPUT_DIR': '.'},
 'DATA': {'DDP_BUCKET_CAP_MB': 25,
          'ENABLE_ASYNC_GPU_COPY': True,
          'NUM_DATALOADER_WORKERS': 10,
          'PIN_MEMORY': True,
          'TEST': {'BASE_DATASET': 'generic_ssl',
                   'BATCHSIZE_PER_REPLICA': 256,
                   'COLLATE_FUNCTION': 'default_collate',
                   'COLLATE_FUNCTION_PARAMS': {},
                   'COPY_DESTINATION_DIR': '',
                   'COPY_TO_LOCAL_DISK': False,
                   'DATASET_NAMES': ['imagenet1k_folder'],
                   'DATA_LIMIT': -1,
                   'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False,
                                           'SEED': 0,
                                           'SKIP_NUM_SAMPLES': 0},
                   'DATA_PATHS': [],
                   'DATA_SOURCES': [],
                   'DEFAULT_GRAY_IMG_SIZE': 224,
                   'DROP_LAST': False,
                   'ENABLE_QUEUE_DATASET': False,
                   'INPUT_KEY_NAMES': ['data'],
                   'LABEL_PATHS': [],
                   'LABEL_SOURCES': [],
                   'LABEL_TYPE': 'sample_index',
                   'MMAP_MODE': True,
                   'NEW_IMG_PATH_PREFIX': '',
                   'RANDOM_SYNTHETIC_IMAGES': False,
                   'REMOVE_IMG_PATH_PREFIX': '',
                   'TARGET_KEY_NAMES': ['label'],
                   'TRANSFORMS': [],
                   'USE_DEBUGGING_SAMPLER': False,
                   'USE_STATEFUL_DISTRIBUTED_SAMPLER': False},
          'TRAIN': {'BASE_DATASET': 'generic_ssl',
                    'BATCHSIZE_PER_REPLICA': 128,
                    'COLLATE_FUNCTION': 'moco_collator',
                    'COLLATE_FUNCTION_PARAMS': {},
                    'COPY_DESTINATION_DIR': '/tmp/imagenet1k/',
                    'COPY_TO_LOCAL_DISK': False,
                    'DATASET_NAMES': ['ssl_xxx_dataset'],
                    'DATA_LIMIT': -1,
                    'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False,
                                            'SEED': 0,
                                            'SKIP_NUM_SAMPLES': 0},
                    'DATA_PATHS': ['/gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/'],
                    'DATA_SOURCES': ['disk_folder'],
                    'DEFAULT_GRAY_IMG_SIZE': 224,
                    'DROP_LAST': True,
                    'ENABLE_QUEUE_DATASET': False,
                    'INPUT_KEY_NAMES': ['data'],
                    'LABEL_PATHS': [],
                    'LABEL_SOURCES': [],
                    'LABEL_TYPE': 'sample_index',
                    'MMAP_MODE': True,
                    'NEW_IMG_PATH_PREFIX': '',
                    'RANDOM_SYNTHETIC_IMAGES': False,
                    'REMOVE_IMG_PATH_PREFIX': '',
                    'TARGET_KEY_NAMES': ['label'],
                    'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2},
                                   {'name': 'RandomResizedCrop', 'size': 224},
                                   {'name': 'ImgPilColorDistortion',
                                    'strength': 0.5},
                                   {'name': 'ImgPilGaussianBlur',
                                    'p': 0.5,
                                    'radius_max': 2.0,
                                    'radius_min': 0.1},
                                   {'name': 'RandomHorizontalFlip', 'p': 0.5},
                                   {'name': 'ToTensor'},
                                   {'mean': [0.485, 0.456, 0.406],
                                    'name': 'Normalize',
                                    'std': [0.229, 0.224, 0.225]}],
                    'USE_DEBUGGING_SAMPLER': False,
                    'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}},
 'DISTRIBUTED': {'BACKEND': 'nccl',
                 'BROADCAST_BUFFERS': True,
                 'INIT_METHOD': 'tcp',
                 'MANUAL_GRADIENT_REDUCTION': False,
                 'NCCL_DEBUG': False,
                 'NCCL_SOCKET_NTHREADS': '',
                 'NUM_NODES': 4,
                 'NUM_PROC_PER_NODE': 4,
                 'RUN_ID': 'r7i0n8:40050'},
 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''},
 'HOOKS': {'CHECK_NAN': True,
           'LOG_GPU_STATS': True,
           'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False,
                              'LOG_ITERATION_NUM': 0,
                              'PRINT_MEMORY_SUMMARY': True},
           'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False,
                                'INPUT_SHAPE': [3, 224, 224]},
           'PERF_STATS': {'MONITOR_PERF_STATS': True,
                          'PERF_STAT_FREQUENCY': -1,
                          'ROLLING_BTIME_FREQ': 313},
           'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'moco_v2_reference',
                                 'FLUSH_EVERY_N_MIN': 20,
                                 'LOG_DIR': '.',
                                 'LOG_PARAMS': False,
                                 'LOG_PARAMS_EVERY_N_ITERS': 310,
                                 'LOG_PARAMS_GRADIENTS': True,
                                 'USE_TENSORBOARD': True}},
 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False,
                   'DATASET_PATH': '',
                   'DEBUG_MODE': False,
                   'EVAL_BINARY_PATH': '',
                   'EVAL_DATASET_NAME': 'Paris',
                   'FEATS_PROCESSING_TYPE': '',
                   'GEM_POOL_POWER': 4.0,
                   'IMG_SCALINGS': [1],
                   'NORMALIZE_FEATURES': True,
                   'NUM_DATABASE_SAMPLES': -1,
                   'NUM_QUERY_SAMPLES': -1,
                   'NUM_TRAINING_SAMPLES': -1,
                   'N_PCA': 512,
                   'RESIZE_IMG': 1024,
                   'SAVE_FEATURES': False,
                   'SAVE_RETRIEVAL_RANKINGS_SCORES': True,
                   'SIMILARITY_MEASURE': 'cosine_similarity',
                   'SPATIAL_LEVELS': 3,
                   'TRAIN_DATASET_NAME': 'Oxford',
                   'TRAIN_PCA_WHITENING': True,
                   'USE_DISTRACTORS': False,
                   'WHITEN_IMG_LIST': ''},
 'LOG_FREQUENCY': 200,
 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1},
          'barlow_twins_loss': {'embedding_dim': 8192,
                                'lambda_': 0.0051,
                                'scale_loss': 0.024},
          'bce_logits_multiple_output_single_target': {'normalize_output': False,
                                                       'reduction': 'none',
                                                       'world_size': 1},
          'cross_entropy_multiple_output_single_target': {'ignore_index': -1,
                                                          'normalize_output': False,
                                                          'reduction': 'mean',
                                                          'temperature': 1.0,
                                                          'weight': None},
          'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256,
                                 'DROP_LAST': True,
                                 'kmeans_iters': 10,
                                 'memory_params': {'crops_for_mb': [0],
                                                   'embedding_dim': 128},
                                 'num_clusters': [3000, 3000, 3000],
                                 'num_crops': 2,
                                 'num_train_samples': -1,
                                 'temperature': 0.1},
          'dino_loss': {'crops_for_teacher': [0, 1],
                        'ema_center': 0.9,
                        'momentum': 0.996,
                        'normalize_last_layer': True,
                        'output_dim': 65536,
                        'student_temp': 0.1,
                        'teacher_temp_max': 0.07,
                        'teacher_temp_min': 0.04,
                        'teacher_temp_warmup_iters': 37500},
          'moco_loss': {'embedding_dim': 128,
                        'momentum': 0.999,
                        'queue_size': 65536,
                        'temperature': 0.2},
          'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
                                                               'embedding_dim': 128,
                                                               'world_size': 64},
                                             'num_crops': 2,
                                             'temperature': 0.1},
          'name': 'moco_loss',
          'nce_loss_with_memory': {'loss_type': 'nce',
                                   'loss_weights': [1.0],
                                   'memory_params': {'embedding_dim': 128,
                                                     'memory_size': -1,
                                                     'momentum': 0.5,
                                                     'norm_init': True,
                                                     'update_mem_on_forward': True},
                                   'negative_sampling_params': {'num_negatives': 16000,
                                                                'type': 'random'},
                                   'norm_constant': -1,
                                   'norm_embedding': True,
                                   'num_train_samples': -1,
                                   'temperature': 0.07,
                                   'update_mem_with_emb_index': -100},
          'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
                                                     'embedding_dim': 128,
                                                     'world_size': 64},
                                   'temperature': 0.1},
          'swav_loss': {'crops_for_assign': [0, 1],
                        'embedding_dim': 128,
                        'epsilon': 0.05,
                        'normalize_last_layer': True,
                        'num_crops': 2,
                        'num_iters': 3,
                        'num_prototypes': [3000],
                        'output_dir': '.',
                        'queue': {'local_queue_length': 0,
                                  'queue_length': 0,
                                  'start_iter': 0},
                        'temp_hard_assignment_iters': 0,
                        'temperature': 0.1,
                        'use_double_precision': False},
          'swav_momentum_loss': {'crops_for_assign': [0, 1],
                                 'embedding_dim': 128,
                                 'epsilon': 0.05,
                                 'momentum': 0.99,
                                 'momentum_eval_mode_iter_start': 0,
                                 'normalize_last_layer': True,
                                 'num_crops': 2,
                                 'num_iters': 3,
                                 'num_prototypes': [3000],
                                 'queue': {'local_queue_length': 0,
                                           'queue_length': 0,
                                           'start_iter': 0},
                                 'temperature': 0.1,
                                 'use_double_precision': False}},
 'MACHINE': {'DEVICE': 'gpu'},
 'METERS': {'accuracy_list_meter': {'meter_names': [],
                                    'num_meters': 1,
                                    'topk_values': [1]},
            'enable_training_meter': True,
            'mean_ap_list_meter': {'max_cpu_capacity': -1,
                                   'meter_names': [],
                                   'num_classes': 9605,
                                   'num_meters': 1},
            'model_output_mask': False,
            'name': '',
            'names': [],
            'precision_at_k_list_meter': {'meter_names': [],
                                          'num_meters': 1,
                                          'topk_values': [1]},
            'recall_at_k_list_meter': {'meter_names': [],
                                       'num_meters': 1,
                                       'topk_values': [1]}},
 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2,
                                        'USE_ACTIVATION_CHECKPOINTING': False},
           'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O1'},
                          'AMP_TYPE': 'apex',
                          'USE_AMP': False},
           'BASE_MODEL_NAME': 'multi_input_output_model',
           'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100},
           'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False,
                                     'EVAL_TRUNK_AND_HEAD': False,
                                     'EXTRACT_TRUNK_FEATURES_ONLY': False,
                                     'FREEZE_TRUNK_AND_HEAD': False,
                                     'FREEZE_TRUNK_ONLY': False,
                                     'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [],
                                     'SHOULD_FLATTEN_FEATS': True},
           'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0,
                           'bucket_cap_mb': 0,
                           'clear_autocast_cache': True,
                           'compute_dtype': torch.float32,
                           'flatten_parameters': True,
                           'fp32_reduce_scatter': False,
                           'mixed_precision': True,
                           'verbose': True},
           'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False},
           'HEAD': {'BATCHNORM_EPS': 1e-05,
                    'BATCHNORM_MOMENTUM': 0.1,
                    'PARAMS': [['mlp',
                                {'dims': [2048, 2048],
                                 'skip_last_layer_relu_bn': False,
                                 'use_relu': True}],
                               ['mlp', {'dims': [2048, 128]}]],
                    'PARAMS_MULTIPLIER': 1.0},
           'INPUT_TYPE': 'rgb',
           'MULTI_INPUT_HEAD_MAPPING': [],
           'NON_TRAINABLE_PARAMS': [],
           'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1},
           'SINGLE_PASS_EVERY_CROP': False,
           'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': False,
                              'GROUP_SIZE': -1,
                              'SYNC_BN_TYPE': 'pytorch'},
           'TEMP_FROZEN_PARAMS_ITER_MAP': [],
           'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False,
                                'LOCALITY_DIM': 10,
                                'LOCALITY_STRENGTH': 1.0,
                                'N_GPSA_LAYERS': 10,
                                'USE_LOCAL_INIT': True},
                     'EFFICIENT_NETS': {},
                     'NAME': 'resnet',
                     'REGNET': {},
                     'RESNETS': {'DEPTH': 50,
                                 'GROUPNORM_GROUPS': 32,
                                 'GROUPS': 1,
                                 'LAYER4_STRIDE': 2,
                                 'NORM': 'BatchNorm',
                                 'STANDARDIZE_CONVOLUTIONS': False,
                                 'WIDTH_MULTIPLIER': 2,
                                 'WIDTH_PER_GROUP': 64,
                                 'ZERO_INIT_RESIDUAL': True},
                     'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0,
                                             'CLASSIFIER': 'token',
                                             'DROPOUT_RATE': 0,
                                             'DROP_PATH_RATE': 0,
                                             'HIDDEN_DIM': 768,
                                             'IMAGE_SIZE': 224,
                                             'MLP_DIM': 3072,
                                             'NUM_HEADS': 12,
                                             'NUM_LAYERS': 12,
                                             'PATCH_SIZE': 16,
                                             'QKV_BIAS': False,
                                             'QK_SCALE': False,
                                             'name': None},
                     'XCIT': {'ATTENTION_DROPOUT_RATE': 0,
                              'DROPOUT_RATE': 0,
                              'DROP_PATH_RATE': 0.05,
                              'ETA': 1,
                              'HIDDEN_DIM': 384,
                              'IMAGE_SIZE': 224,
                              'NUM_HEADS': 8,
                              'NUM_LAYERS': 12,
                              'PATCH_SIZE': 16,
                              'QKV_BIAS': True,
                              'QK_SCALE': False,
                              'TOKENS_NORM': True,
                              'name': None}},
           'WEIGHTS_INIT': {'APPEND_PREFIX': '',
                            'PARAMS_FILE': '',
                            'REMOVE_PREFIX': '',
                            'SKIP_LAYERS': ['num_batches_tracked'],
                            'STATE_DICT_KEY_NAME': 'classy_state_dict'},
           '_MODEL_INIT_SEED': 0},
 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0},
 'MULTI_PROCESSING_METHOD': 'forkserver',
 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200},
 'OPTIMIZER': {'betas': [0.9, 0.999],
               'construct_single_param_group_only': False,
               'head_optimizer_params': {'use_different_lr': False,
                                         'use_different_wd': False,
                                         'weight_decay': 0.0001},
               'larc_config': {'clip': False,
                               'eps': 1e-08,
                               'trust_coefficient': 0.001},
               'momentum': 0.9,
               'name': 'sgd',
               'nesterov': True,
               'non_regularized_parameters': [],
               'num_epochs': 200,
               'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False,
                                                               'base_lr_batch_size': 256,
                                                               'base_value': 0.1,
                                                               'scaling_type': 'linear'},
                                           'end_value': 0.0,
                                           'interval_scaling': [],
                                           'lengths': [],
                                           'milestones': [120, 160],
                                           'name': 'multistep',
                                           'schedulers': [],
                                           'start_value': 0.1,
                                           'update_interval': 'epoch',
                                           'value': 0.1,
                                           'values': [0.03, 0.003, 0.0003]},
                                    'lr_head': {'auto_lr_scaling': {'auto_scale': False,
                                                                    'base_lr_batch_size': 256,
                                                                    'base_value': 0.1,
                                                                    'scaling_type': 'linear'},
                                                'end_value': 0.0,
                                                'interval_scaling': [],
                                                'lengths': [],
                                                'milestones': [120, 160],
                                                'name': 'multistep',
                                                'schedulers': [],
                                                'start_value': 0.1,
                                                'update_interval': 'epoch',
                                                'value': 0.1,
                                                'values': [0.03,
                                                           0.003,
                                                           0.0003]}},
               'regularize_bias': True,
               'regularize_bn': True,
               'use_larc': False,
               'use_zero': False,
               'weight_decay': 0.0001},
 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False},
               'NUM_ITERATIONS': 10,
               'OUTPUT_FOLDER': '.',
               'PROFILED_RANKS': [0, 1],
               'RUNTIME_PROFILING': {'LEGACY_PROFILER': False,
                                     'PROFILE_CPU': True,
                                     'PROFILE_GPU': True,
                                     'USE_PROFILER': False},
               'START_ITERATION': 0,
               'STOP_TRAINING_AFTER_PROFILING': False,
               'WARMUP_ITERATIONS': 0},
 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False},
 'SEED_VALUE': 0,
 'SLURM': {'ADDITIONAL_PARAMETERS': {'hint': 'nomultithread',
                                     'qos': 'qos_gpu-t3'},
           'COMMENT': 'vissl job',
           'CONSTRAINT': 'v100-32g',
           'LOG_FOLDER': '/gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19',
           'MEM_GB': 160,
           'NAME': 'vissl',
           'NUM_CPU_PER_PROC': 10,
           'PARTITION': '',
           'PORT_ID': 40050,
           'TIME_HOURS': 20,
           'TIME_MINUTES': 0,
           'USE_SLURM': True},
 'SVM': {'cls_list': [],
         'costs': {'base': -1.0,
                   'costs_list': [0.1, 0.01],
                   'power_range': [4, 20]},
         'cross_val_folds': 3,
         'dual': True,
         'force_retrain': False,
         'loss': 'squared_hinge',
         'low_shot': {'dataset_name': 'voc',
                      'k_values': [1, 2, 4, 8, 16, 32, 64, 96],
                      'sample_inds': [1, 2, 3, 4, 5]},
         'max_iter': 2000,
         'normalize': True,
         'penalty': 'l2'},
 'TEST_EVERY_NUM_EPOCH': 1,
 'TEST_MODEL': False,
 'TEST_ONLY': False,
 'TRAINER': {'TASK_NAME': 'self_supervision_task',
             'TRAIN_STEP_NAME': 'standard_train_step'},
 'VERBOSE': False}
INFO 2021-11-14 20:41:44,560 train.py:  94: Env set for rank: 2, dist_rank: 14
INFO 2021-11-14 20:41:44,560 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:41:44,560 train.py: 105: Setting seed....
INFO 2021-11-14 20:41:44,560 misc.py: 173: MACHINE SEED: 2800
INFO 2021-11-14 20:41:44,574 train.py:  94: Env set for rank: 3, dist_rank: 15
INFO 2021-11-14 20:41:44,574 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:41:44,574 train.py: 105: Setting seed....
INFO 2021-11-14 20:41:44,574 misc.py: 173: MACHINE SEED: 3000
WARNING 2021-11-14 20:41:44,593 moco_hooks.py:  45: Batch shuffling: True
INFO 2021-11-14 20:41:44,593 tensorboard.py:  49: Tensorboard dir: /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//tb_logs
INFO 2021-11-14 20:41:44,595 tensorboard_hook.py:  90: Setting up SSL Tensorboard Hook...
INFO 2021-11-14 20:41:44,596 tensorboard_hook.py: 103: Tensorboard config: log_params: False, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-11-14 20:41:44,596 trainer_main.py: 113: Using Distributed init method: tcp://r7i0n8:40050, world_size: 16, rank: 14
WARNING 2021-11-14 20:41:44,611 moco_hooks.py:  45: Batch shuffling: True
INFO 2021-11-14 20:41:44,617 tensorboard.py:  49: Tensorboard dir: /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//tb_logs
INFO 2021-11-14 20:41:44,620 tensorboard_hook.py:  90: Setting up SSL Tensorboard Hook...
INFO 2021-11-14 20:41:44,620 tensorboard_hook.py: 103: Tensorboard config: log_params: False, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-11-14 20:41:44,620 trainer_main.py: 113: Using Distributed init method: tcp://r7i0n8:40050, world_size: 16, rank: 15
INFO 2021-11-14 20:41:46,506 train.py: 117: System config:
-------------------  ------------------------------------------------------------------------------------------
sys.platform         linux
Python               3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
numpy                1.19.5
Pillow               8.4.0
vissl                0.1.6 @/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl
GPU available        True
GPU 0,1,2,3          Tesla V100-SXM2-32GB
CUDA_HOME            /gpfslocalsys/cuda/10.2
torchvision          0.11.1+cu102 @/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/torchvision
hydra                1.0.7 @/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/hydra
classy_vision        0.7.0.dev @/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/classy_vision
tensorboard          2.7.0
apex                 0.1 @/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/apex
cv2                  4.5.4-dev
PyTorch              1.10.0+cu102 @/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/torch
PyTorch debug build  False
-------------------  ------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

CPU info:
-------------------  ----------------------------------------
Architecture         x86_64
CPU op-mode(s)       32-bit, 64-bit
Byte Order           Little Endian
CPU(s)               80
On-line CPU(s) list  0-79
Thread(s) per core   2
Core(s) per socket   20
Socket(s)            2
NUMA node(s)         2
Vendor ID            GenuineIntel
CPU family           6
Model                85
Model name           Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Stepping             7
CPU MHz              2500.006
CPU max MHz          3900.0000
CPU min MHz          1000.0000
BogoMIPS             5000.00
Virtualization       VT-x
L1d cache            32K
L1i cache            32K
L2 cache             1024K
L3 cache             28160K
NUMA node0 CPU(s)    0-19,40-59
NUMA node1 CPU(s)    20-39,60-79
-------------------  ----------------------------------------
WARNING 2021-11-14 20:41:46,507 moco_hooks.py:  45: Batch shuffling: True
INFO 2021-11-14 20:41:46,507 tensorboard.py:  49: Tensorboard dir: /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//tb_logs
INFO 2021-11-14 20:41:46,511 tensorboard_hook.py:  90: Setting up SSL Tensorboard Hook...
INFO 2021-11-14 20:41:46,512 tensorboard_hook.py: 103: Tensorboard config: log_params: False, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-11-14 20:41:46,512 trainer_main.py: 113: Using Distributed init method: tcp://r7i0n8:40050, world_size: 16, rank: 12
INFO 2021-11-14 20:41:47,538 trainer_main.py: 134: | initialized host r9i6n7 as rank 14 (14)
INFO 2021-11-14 20:41:47,539 trainer_main.py: 134: | initialized host r9i6n7 as rank 13 (13)
INFO 2021-11-14 20:41:47,541 trainer_main.py: 134: | initialized host r9i6n7 as rank 15 (15)
INFO 2021-11-14 20:41:47,544 trainer_main.py: 134: | initialized host r9i6n7 as rank 12 (12)
INFO 2021-11-14 20:41:52,804 train_task.py: 181: Not using Automatic Mixed Precision
INFO 2021-11-14 20:41:52,805 train_task.py: 181: Not using Automatic Mixed Precision
INFO 2021-11-14 20:41:52,805 train_task.py: 181: Not using Automatic Mixed Precision
INFO 2021-11-14 20:41:52,805 train_task.py: 181: Not using Automatic Mixed Precision
INFO 2021-11-14 20:41:52,805 train_task.py: 455: Building model....
INFO 2021-11-14 20:41:52,805 train_task.py: 455: Building model....
INFO 2021-11-14 20:41:52,805 train_task.py: 455: Building model....
INFO 2021-11-14 20:41:52,805 resnext.py:  68: ResNeXT trunk, supports activation checkpointing. Deactivated
INFO 2021-11-14 20:41:52,805 train_task.py: 455: Building model....
INFO 2021-11-14 20:41:52,806 resnext.py:  88: Building model: ResNeXt50-1x64d-w2-BatchNorm2d
INFO 2021-11-14 20:41:52,806 resnext.py:  68: ResNeXT trunk, supports activation checkpointing. Deactivated
INFO 2021-11-14 20:41:52,806 resnext.py:  68: ResNeXT trunk, supports activation checkpointing. Deactivated
INFO 2021-11-14 20:41:52,806 resnext.py:  88: Building model: ResNeXt50-1x64d-w2-BatchNorm2d
INFO 2021-11-14 20:41:52,806 resnext.py:  68: ResNeXT trunk, supports activation checkpointing. Deactivated
INFO 2021-11-14 20:41:52,806 resnext.py:  88: Building model: ResNeXt50-1x64d-w2-BatchNorm2d
INFO 2021-11-14 20:41:52,806 resnext.py:  88: Building model: ResNeXt50-1x64d-w2-BatchNorm2d
INFO 2021-11-14 20:41:54,503 train_task.py: 657: Broadcast model BN buffers from primary on every forward pass
INFO 2021-11-14 20:41:54,504 classification_task.py: 387: Synchronized Batch Normalization is disabled
INFO 2021-11-14 20:41:54,549 train_task.py: 657: Broadcast model BN buffers from primary on every forward pass
INFO 2021-11-14 20:41:54,549 train_task.py: 657: Broadcast model BN buffers from primary on every forward pass
INFO 2021-11-14 20:41:54,550 classification_task.py: 387: Synchronized Batch Normalization is disabled
INFO 2021-11-14 20:41:54,550 classification_task.py: 387: Synchronized Batch Normalization is disabled
INFO 2021-11-14 20:41:54,565 optimizer_helper.py: 294:
Trainable params: 163,
Non-Trainable params: 0,
Trunk Regularized Parameters: 159,
Trunk Unregularized Parameters 0,
Head Regularized Parameters: 4,
Head Unregularized Parameters: 0
Remaining Regularized Parameters: 0
Remaining Unregularized Parameters: 0
INFO 2021-11-14 20:41:54,565 util.py: 240: Broadcasting checkpoint loaded from /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//model_iteration59000.torch
INFO 2021-11-14 20:41:54,573 train_task.py: 657: Broadcast model BN buffers from primary on every forward pass
INFO 2021-11-14 20:41:54,573 classification_task.py: 387: Synchronized Batch Normalization is disabled
INFO 2021-11-14 20:41:54,615 optimizer_helper.py: 294:
Trainable params: 163,
Non-Trainable params: 0,
Trunk Regularized Parameters: 159,
Trunk Unregularized Parameters 0,
Head Regularized Parameters: 4,
Head Unregularized Parameters: 0
Remaining Regularized Parameters: 0
Remaining Unregularized Parameters: 0
INFO 2021-11-14 20:41:54,616 util.py: 240: Broadcasting checkpoint loaded from /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//model_iteration59000.torch
INFO 2021-11-14 20:41:54,622 optimizer_helper.py: 294:
Trainable params: 163,
Non-Trainable params: 0,
Trunk Regularized Parameters: 159,
Trunk Unregularized Parameters 0,
Head Regularized Parameters: 4,
Head Unregularized Parameters: 0
Remaining Regularized Parameters: 0
Remaining Unregularized Parameters: 0
INFO 2021-11-14 20:41:54,622 util.py: 240: Broadcasting checkpoint loaded from /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//model_iteration59000.torch
INFO 2021-11-14 20:41:54,661 optimizer_helper.py: 294:
Trainable params: 163,
Non-Trainable params: 0,
Trunk Regularized Parameters: 159,
Trunk Unregularized Parameters 0,
Head Regularized Parameters: 4,
Head Unregularized Parameters: 0
Remaining Regularized Parameters: 0
Remaining Unregularized Parameters: 0
INFO 2021-11-14 20:41:54,662 util.py: 240: Broadcasting checkpoint loaded from /gpfsscratch/rech/htc/xxx/vissl_runs/checkpoint/xxx/vissl/2021-11-14-20-41-19/checkpoints//model_iteration59000.torch
INFO 2021-11-14 20:42:29,613 img_replicate_pil.py:  52: ImgReplicatePil | Using num_times: 2
INFO 2021-11-14 20:42:29,613 img_pil_color_distortion.py:  56: ImgPilColorDistortion | Using strength: 0.5
INFO 2021-11-14 20:42:29,614 ssl_dataset.py: 157: Rank: 1 split: TRAIN Data files:
['/gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/']
INFO 2021-11-14 20:42:29,614 ssl_dataset.py: 160: Rank: 1 split: TRAIN Label files:
[]
INFO 2021-11-14 20:42:29,711 img_replicate_pil.py:  52: ImgReplicatePil | Using num_times: 2
INFO 2021-11-14 20:42:29,712 img_pil_color_distortion.py:  56: ImgPilColorDistortion | Using strength: 0.5
INFO 2021-11-14 20:42:29,712 ssl_dataset.py: 157: Rank: 2 split: TRAIN Data files:
['/gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/']
INFO 2021-11-14 20:42:29,712 ssl_dataset.py: 160: Rank: 2 split: TRAIN Label files:
[]
INFO 2021-11-14 20:42:29,758 img_replicate_pil.py:  52: ImgReplicatePil | Using num_times: 2
INFO 2021-11-14 20:42:29,758 img_pil_color_distortion.py:  56: ImgPilColorDistortion | Using strength: 0.5
INFO 2021-11-14 20:42:29,759 ssl_dataset.py: 157: Rank: 3 split: TRAIN Data files:
['/gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/']
INFO 2021-11-14 20:42:29,759 ssl_dataset.py: 160: Rank: 3 split: TRAIN Label files:
[]
INFO 2021-11-14 20:42:29,773 img_replicate_pil.py:  52: ImgReplicatePil | Using num_times: 2
INFO 2021-11-14 20:42:29,774 img_pil_color_distortion.py:  56: ImgPilColorDistortion | Using strength: 0.5
INFO 2021-11-14 20:42:29,774 ssl_dataset.py: 157: Rank: 0 split: TRAIN Data files:
['/gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/']
INFO 2021-11-14 20:42:29,774 ssl_dataset.py: 160: Rank: 0 split: TRAIN Label files:
[]
INFO 2021-11-14 20:43:00,895 disk_dataset.py:  86: Loaded 4231831 samples from folder /gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/
INFO 2021-11-14 20:43:00,895 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:43:00,916 disk_dataset.py:  86: Loaded 4231831 samples from folder /gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/
INFO 2021-11-14 20:43:00,916 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:43:00,917 __init__.py: 126: Created the Distributed Sampler....
INFO 2021-11-14 20:43:00,917 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 15, 'epoch': 0, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:43:00,917 __init__.py: 215: Wrapping the dataloader to async device copies
INFO 2021-11-14 20:43:00,923 train_task.py: 384: Building loss...
INFO 2021-11-14 20:43:00,924 disk_dataset.py:  86: Loaded 4231831 samples from folder /gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/
INFO 2021-11-14 20:43:00,924 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:43:00,932 disk_dataset.py:  86: Loaded 4231831 samples from folder /gpfsscratch/rech/htc/xxx/new_new_rl_dataset/TCGA_xxx/
INFO 2021-11-14 20:43:00,932 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2021-11-14 20:43:00,938 __init__.py: 126: Created the Distributed Sampler....
INFO 2021-11-14 20:43:00,938 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 13, 'epoch': 0, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:43:00,939 __init__.py: 215: Wrapping the dataloader to async device copies
INFO 2021-11-14 20:43:00,944 train_task.py: 384: Building loss...
INFO 2021-11-14 20:43:00,946 __init__.py: 126: Created the Distributed Sampler....
INFO 2021-11-14 20:43:00,946 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 14, 'epoch': 0, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:43:00,946 __init__.py: 215: Wrapping the dataloader to async device copies
INFO 2021-11-14 20:43:00,952 train_task.py: 384: Building loss...
INFO 2021-11-14 20:43:00,954 __init__.py: 126: Created the Distributed Sampler....
INFO 2021-11-14 20:43:00,954 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 12, 'epoch': 0, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:43:00,955 __init__.py: 215: Wrapping the dataloader to async device copies
INFO 2021-11-14 20:43:00,962 train_task.py: 384: Building loss...
INFO 2021-11-14 20:43:01,104 moco_loss.py: 173: Storing the checkpoint for later use
INFO 2021-11-14 20:43:01,105 train_task.py: 759: ======Loaded loss state from checkpoint======
INFO 2021-11-14 20:43:01,105 train_task.py: 576: =======Updating classy state_dict from checkpoint=======
INFO 2021-11-14 20:43:01,105 base_ssl_model.py: 446: Rank 3: Loading Trunk state dict....
INFO 2021-11-14 20:43:01,131 moco_loss.py: 173: Storing the checkpoint for later use
INFO 2021-11-14 20:43:01,131 train_task.py: 759: ======Loaded loss state from checkpoint======
INFO 2021-11-14 20:43:01,132 train_task.py: 576: =======Updating classy state_dict from checkpoint=======
INFO 2021-11-14 20:43:01,132 base_ssl_model.py: 446: Rank 1: Loading Trunk state dict....
INFO 2021-11-14 20:43:01,138 moco_loss.py: 173: Storing the checkpoint for later use
INFO 2021-11-14 20:43:01,138 train_task.py: 759: ======Loaded loss state from checkpoint======
INFO 2021-11-14 20:43:01,138 train_task.py: 576: =======Updating classy state_dict from checkpoint=======
INFO 2021-11-14 20:43:01,138 base_ssl_model.py: 446: Rank 0: Loading Trunk state dict....
INFO 2021-11-14 20:43:01,146 moco_loss.py: 173: Storing the checkpoint for later use
INFO 2021-11-14 20:43:01,146 train_task.py: 759: ======Loaded loss state from checkpoint======
INFO 2021-11-14 20:43:01,146 train_task.py: 576: =======Updating classy state_dict from checkpoint=======
INFO 2021-11-14 20:43:01,146 base_ssl_model.py: 446: Rank 2: Loading Trunk state dict....
INFO 2021-11-14 20:43:01,199 base_ssl_model.py: 459: Rank 3: Loading Heads state dict....
INFO 2021-11-14 20:43:01,203 base_ssl_model.py: 459: Rank 2: Loading Heads state dict....
INFO 2021-11-14 20:43:01,204 base_ssl_model.py: 474: Rank 3: Model state dict loaded!
INFO 2021-11-14 20:43:01,206 base_ssl_model.py: 474: Rank 2: Model state dict loaded!
INFO 2021-11-14 20:43:01,207 base_ssl_model.py: 459: Rank 0: Loading Heads state dict....
INFO 2021-11-14 20:43:01,211 base_ssl_model.py: 474: Rank 0: Model state dict loaded!
INFO 2021-11-14 20:43:01,212 base_ssl_model.py: 459: Rank 1: Loading Heads state dict....
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.conv1.weight                              of shape: torch.Size([64, 3, 7, 7]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.bn1.weight                                of shape: torch.Size([64]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.bn1.bias                                  of shape: torch.Size([64]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.bn1.running_mean                          of shape: torch.Size([64]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.bn1.running_var                           of shape: torch.Size([64]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 662: Ignored layer: _feature_blocks.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.conv1.weight                     of shape: torch.Size([128, 64, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn1.weight                       of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn1.bias                         of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn1.running_mean                 of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn1.running_var                  of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.0.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.conv2.weight                     of shape: torch.Size([128, 128, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn2.weight                       of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn2.bias                         of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn2.running_mean                 of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn2.running_var                  of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,213 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.0.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,213 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.conv3.weight                     of shape: torch.Size([256, 128, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn3.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn3.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn3.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.bn3.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.0.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.downsample.0.weight              of shape: torch.Size([256, 64, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.downsample.1.weight              of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.downsample.1.bias                of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.downsample.1.running_mean        of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.0.downsample.1.running_var         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.0.downsample.1.num_batches_tracked
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.conv1.weight                     of shape: torch.Size([128, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn1.weight                       of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn1.bias                         of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn1.running_mean                 of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn1.running_var                  of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.1.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.conv2.weight                     of shape: torch.Size([128, 128, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn2.weight                       of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn2.bias                         of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn2.running_mean                 of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn2.running_var                  of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,214 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.1.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.conv3.weight                     of shape: torch.Size([256, 128, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn3.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn3.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn3.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.1.bn3.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.1.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.conv1.weight                     of shape: torch.Size([128, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn1.weight                       of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn1.bias                         of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn1.running_mean                 of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn1.running_var                  of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.2.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.conv2.weight                     of shape: torch.Size([128, 128, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn2.weight                       of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn2.bias                         of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn2.running_mean                 of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn2.running_var                  of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.2.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.conv3.weight                     of shape: torch.Size([256, 128, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn3.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn3.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,215 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn3.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer1.2.bn3.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 662: Ignored layer: _feature_blocks.layer1.2.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.conv1.weight                     of shape: torch.Size([256, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn1.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn1.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn1.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn1.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.0.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.conv2.weight                     of shape: torch.Size([256, 256, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn2.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn2.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn2.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn2.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.0.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.conv3.weight                     of shape: torch.Size([512, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn3.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn3.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn3.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.bn3.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.0.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.downsample.0.weight              of shape: torch.Size([512, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.downsample.1.weight              of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,216 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.downsample.1.bias                of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.downsample.1.running_mean        of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.0.downsample.1.running_var         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.0.downsample.1.num_batches_tracked
INFO 2021-11-14 20:43:01,217 base_ssl_model.py: 474: Rank 1: Model state dict loaded!
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.conv1.weight                     of shape: torch.Size([256, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn1.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn1.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn1.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn1.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.1.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.conv2.weight                     of shape: torch.Size([256, 256, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn2.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn2.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn2.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn2.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.1.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.conv3.weight                     of shape: torch.Size([512, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn3.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn3.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn3.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.1.bn3.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.1.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.conv1.weight                     of shape: torch.Size([256, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,217 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn1.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn1.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn1.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn1.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.2.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.conv2.weight                     of shape: torch.Size([256, 256, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn2.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn2.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn2.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn2.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.2.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.conv3.weight                     of shape: torch.Size([512, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn3.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn3.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn3.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.2.bn3.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.2.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.conv1.weight                     of shape: torch.Size([256, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn1.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn1.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn1.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn1.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,218 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.3.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,218 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.conv2.weight                     of shape: torch.Size([256, 256, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn2.weight                       of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn2.bias                         of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn2.running_mean                 of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn2.running_var                  of shape: torch.Size([256]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.3.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.conv3.weight                     of shape: torch.Size([512, 256, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn3.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn3.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn3.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer2.3.bn3.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 662: Ignored layer: _feature_blocks.layer2.3.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.conv1.weight                     of shape: torch.Size([512, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn1.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn1.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn1.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn1.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.0.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.conv2.weight                     of shape: torch.Size([512, 512, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn2.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn2.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn2.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn2.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,219 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.0.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.conv3.weight                     of shape: torch.Size([1024, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn3.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn3.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn3.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.bn3.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.0.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.downsample.0.weight              of shape: torch.Size([1024, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.downsample.1.weight              of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.downsample.1.bias                of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.downsample.1.running_mean        of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.0.downsample.1.running_var         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.0.downsample.1.num_batches_tracked
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.conv1.weight                     of shape: torch.Size([512, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn1.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn1.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn1.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn1.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.1.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.conv2.weight                     of shape: torch.Size([512, 512, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn2.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn2.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,220 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn2.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn2.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.1.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.conv3.weight                     of shape: torch.Size([1024, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn3.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn3.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn3.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.1.bn3.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.1.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.conv1.weight                     of shape: torch.Size([512, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn1.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn1.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn1.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn1.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.2.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.conv2.weight                     of shape: torch.Size([512, 512, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn2.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn2.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn2.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn2.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.2.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.conv3.weight                     of shape: torch.Size([1024, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn3.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,221 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn3.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn3.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.2.bn3.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.2.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.conv1.weight                     of shape: torch.Size([512, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn1.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn1.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn1.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn1.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.3.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.conv2.weight                     of shape: torch.Size([512, 512, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn2.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn2.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn2.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn2.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.3.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.conv3.weight                     of shape: torch.Size([1024, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn3.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn3.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn3.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.3.bn3.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.3.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.conv1.weight                     of shape: torch.Size([512, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,222 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn1.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn1.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn1.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn1.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.4.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.conv2.weight                     of shape: torch.Size([512, 512, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn2.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn2.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn2.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn2.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.4.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.conv3.weight                     of shape: torch.Size([1024, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn3.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn3.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn3.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.4.bn3.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.4.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.conv1.weight                     of shape: torch.Size([512, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn1.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn1.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn1.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn1.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,223 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.5.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,223 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.conv2.weight                     of shape: torch.Size([512, 512, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn2.weight                       of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn2.bias                         of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn2.running_mean                 of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn2.running_var                  of shape: torch.Size([512]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.5.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.conv3.weight                     of shape: torch.Size([1024, 512, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn3.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn3.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn3.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer3.5.bn3.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 662: Ignored layer: _feature_blocks.layer3.5.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.conv1.weight                     of shape: torch.Size([1024, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn1.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn1.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn1.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn1.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.0.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.conv2.weight                     of shape: torch.Size([1024, 1024, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn2.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn2.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn2.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn2.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,224 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.0.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.conv3.weight                     of shape: torch.Size([2048, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn3.weight                       of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn3.bias                         of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn3.running_mean                 of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.bn3.running_var                  of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.0.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.downsample.0.weight              of shape: torch.Size([2048, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.downsample.1.weight              of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.downsample.1.bias                of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.downsample.1.running_mean        of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.0.downsample.1.running_var         of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.0.downsample.1.num_batches_tracked
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.conv1.weight                     of shape: torch.Size([1024, 2048, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn1.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn1.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn1.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn1.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.1.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.conv2.weight                     of shape: torch.Size([1024, 1024, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn2.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn2.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,225 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn2.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn2.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.1.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.conv3.weight                     of shape: torch.Size([2048, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn3.weight                       of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn3.bias                         of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn3.running_mean                 of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.1.bn3.running_var                  of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.1.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.conv1.weight                     of shape: torch.Size([1024, 2048, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn1.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn1.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn1.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn1.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.2.bn1.num_batches_tracked
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.conv2.weight                     of shape: torch.Size([1024, 1024, 3, 3]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn2.weight                       of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn2.bias                         of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn2.running_mean                 of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn2.running_var                  of shape: torch.Size([1024]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.2.bn2.num_batches_tracked
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.conv3.weight                     of shape: torch.Size([2048, 1024, 1, 1]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn3.weight                       of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,226 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn3.bias                         of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,227 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn3.running_mean                 of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,227 checkpoint.py: 678: Loaded: _feature_blocks.layer4.2.bn3.running_var                  of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,227 checkpoint.py: 662: Ignored layer: _feature_blocks.layer4.2.bn3.num_batches_tracked
INFO 2021-11-14 20:43:01,227 checkpoint.py: 678: Loaded: 0.clf.0.weight                                            of shape: torch.Size([2048, 2048]) from checkpoint
INFO 2021-11-14 20:43:01,227 checkpoint.py: 678: Loaded: 0.clf.0.bias                                              of shape: torch.Size([2048]) from checkpoint
INFO 2021-11-14 20:43:01,227 checkpoint.py: 678: Loaded: 1.clf.0.weight                                            of shape: torch.Size([128, 2048]) from checkpoint
INFO 2021-11-14 20:43:01,227 checkpoint.py: 678: Loaded: 1.clf.0.bias                                              of shape: torch.Size([128]) from checkpoint
INFO 2021-11-14 20:43:01,227 checkpoint.py: 690: Extra layers not loaded from checkpoint: []
INFO 2021-11-14 20:43:01,389 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 14, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:43:01,411 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 12, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:43:01,414 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 15, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:43:01,424 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 13, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:44:23,304 trainer_main.py: 268: Training 200 epochs
INFO 2021-11-14 20:44:23,306 trainer_main.py: 268: Training 200 epochs
INFO 2021-11-14 20:44:23,306 trainer_main.py: 269: One epoch = 2066 iterations.
INFO 2021-11-14 20:44:23,306 trainer_main.py: 269: One epoch = 2066 iterations.
INFO 2021-11-14 20:44:23,307 trainer_main.py: 270: Total 4231831 samples in one epoch
INFO 2021-11-14 20:44:23,307 trainer_main.py: 276: Total 413200 iterations for training
INFO 2021-11-14 20:44:23,307 trainer_main.py: 270: Total 4231831 samples in one epoch
INFO 2021-11-14 20:44:23,307 trainer_main.py: 276: Total 413200 iterations for training
INFO 2021-11-14 20:44:23,307 trainer_main.py: 175: Starting training....
INFO 2021-11-14 20:44:23,307 trainer_main.py: 175: Starting training....
INFO 2021-11-14 20:44:23,307 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 13, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:44:23,307 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 12, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:44:23,349 trainer_main.py: 268: Training 200 epochs
INFO 2021-11-14 20:44:23,349 trainer_main.py: 269: One epoch = 2066 iterations.
INFO 2021-11-14 20:44:23,349 trainer_main.py: 270: Total 4231831 samples in one epoch
INFO 2021-11-14 20:44:23,349 trainer_main.py: 276: Total 413200 iterations for training
INFO 2021-11-14 20:44:23,350 trainer_main.py: 175: Starting training....
INFO 2021-11-14 20:44:23,350 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 15, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
INFO 2021-11-14 20:44:23,375 trainer_main.py: 268: Training 200 epochs
INFO 2021-11-14 20:44:23,375 trainer_main.py: 269: One epoch = 2066 iterations.
INFO 2021-11-14 20:44:23,375 trainer_main.py: 270: Total 4231831 samples in one epoch
INFO 2021-11-14 20:44:23,375 trainer_main.py: 276: Total 413200 iterations for training
INFO 2021-11-14 20:44:23,375 trainer_main.py: 175: Starting training....
INFO 2021-11-14 20:44:23,375 __init__.py: 101: Distributed Sampler config:
{'num_replicas': 16, 'rank': 14, 'epoch': 28, 'num_samples': 264490, 'total_size': 4231840, 'shuffle': True, 'seed': 0}
submitit WARNING (2021-11-14 20:45:12,691) - Bypassing signal SIGCONT
submitit WARNING (2021-11-14 20:45:12,692) - Bypassing signal SIGTERM
submitit ERROR (2021-11-14 20:45:14,016) - Submitted job triggered an exception

And the .err log of one of the node which is probably the most important:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
    hook_generator=hook_generator,
  File "/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl/engines/engine_registry.py", line 93, in run_engine
    hook_generator=hook_generator,
  File "/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl/engines/train.py", line 46, in run_engine
    hook_generator=hook_generator,
  File "/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl/engines/train.py", line 130, in train_main
    trainer.train()
  File "/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl/trainer/trainer_main.py", line 178, in train
    self._advance_phase(task)  # advances task.phase_idx
  File "/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl/trainer/trainer_main.py", line 323, in _advance_phase
    train_phase_idx=task.train_phase_idx,
  File "/gpfsdswork/projects/rech/htc/xxx/workspace/vissl/vissl/trainer/train_task.py", line 564, in recreate_data_iterator
    self.data_iterator = iter(self.dataloaders[phase_type])
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 39, in __iter__
    self._iter = iter(self.dataloader)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
    w.start()
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/multiprocessing/context.py", line 291, in _Popen
    return Popen(process_obj)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/gpfswork/rech/htc/xxx/vissl_env/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
MemoryError
  1. please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Train a model such as resnet50 with mocov2 and get a checkpoint Resume training from the checkpoint and look at the CPU memory usage.

Expected behavior:

I would expect that to resume the training, the same amount of memory is taken than during the initial training.

Environment:

4 Gpus of 32G, on 4 nodes with 180G of RAM. I did not run the environment command because I am using SLURM that distribute the training on other machines.

iseessel commented 2 years ago

@CharlieCheckpt I will try to repro this and get back to you. As I'm sure you know these memory issues are hard to debug, so I would recommend as a short-term solution requesting more RAM if possible.

Can you please send full environment information?

CUDA version, package version list, python version, OS, etc.,

prigoyal commented 2 years ago

you can also try 1) lowering the #dataloader workers 2) use MMAP when loading data

CharlieCheckpt commented 2 years ago

Hello, thank you for your answers !

@prigoyal indeed, there were too many dataloader workers specified in the config used to resume the experiment. NUM_DATALOADER_WORKERS was 5 in my original experiment, was 10 when resuming the experiment. Reducing the number of workers solved the problem.

Can you explain why this parameter has an impact on RAM usage ?

CharlieCheckpt commented 2 years ago

Actually, what I said above is wrong.

For another experiment, I still encounter the issue while I had NUM_DATALOADER_WORKERS=3 during the original script launch and NUM_DATALOADER_WORKERS=3 when resuming the training. I tried to resume with NUM_DATALOADER_WORKERS=2 but still encounter the issue.

EDIT : I also looked at mmap and MMAP_MODE is already equal to True in my config.

EDIT 2: For this new experiment I have more images and more RAM : 43 million images, and 720GB of RAM on each machine (2 machines with 8 GPUs). I can't increase more the memory. The large number of images is probably causing the OOM. Do you see any other way to reduce the memory consumption when resuming the training ?

CharlieCheckpt commented 2 years ago

Hello ! If this can be of any help, the line where the RAM increases dramatically (+500Go or more) is :

self.data_iterator = iter(self.dataloaders[phase_type])

located here.

I am still trying to understand what is going on with this iter(), and why there is no issue during first training, but there is one when resuming from a checkpoint.

iseessel commented 2 years ago

@CharlieCheckpt Sorry for the delay here. Have you made any additional progress here?

Can you please send full environment information from:

wget -nc -q https://github.com/facebookresearch/vissl/raw/main/vissl/utils/collect_env.py && python collect_env.py

I've never had this problem before -- so I'm a bit worried that it may be a Pytorch issue specific to your env. Let's try to bisect the problem.

Can you also send the full configs from both the train and eval environment?

CharlieCheckpt commented 2 years ago

Hi @iseessel , thanks for following-up.

It is also difficult for me to debug these costly experiments, because I'm working on a server with limited credits.

What I can tell you for now is that I couldn't reproduce this RAM increase on a personal server (with less images : 1M instead of 43M). I ran other experiments recently and did not notice this behaviour anymore, so difficult to know what is going on.

Maybe we can close this issue and re-open if it happens again ?

iseessel commented 2 years ago

Yeah sounds good -- there are lots of potential complicated dynamics at play here and unless necessary, I wouldn't recommend going down a rabbit hole on this.