Error when using profiling: "RuntimeError: Expected debug info of type 2"

Instructions To Reproduce the 🐛 Bug:

what changes you made (git diff) or what code you wrote I opened this tutorial and in the cell that executes run_distributed_engines.py I added config.PROFILING.RUNTIME_PROFILING.USE_PROFILER=true to enable the profiler.
what exact command you run: After this change, I ran the notebook and got an error

what you observed (including full logs): You can open the colab notebook here and see all the outputs


** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath

####### overrides: ['hydra.verbose=true', 'config=test/integration_test/quick_simclr.yaml', 'config.DATA.TRAIN.DATA_SOURCES=[synthetic]', 'config.CHECKPOINT.DIR=/content/checkpoints', 'config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true', 'config.PROFILING.RUNTIME_PROFILING.USE_PROFILER=true'] INFO 2022-02-02 10:40:14,840 distributed_launcher.py: 184: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:43839 INFO 2022-02-02 10:40:14,841 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2022-02-02 10:40:14,841 env.py: 50: CLICOLOR: 1 INFO 2022-02-02 10:40:14,841 env.py: 50: CLOUDSDK_CONFIG: /content/.config INFO 2022-02-02 10:40:14,841 env.py: 50: CLOUDSDK_PYTHON: python3 INFO 2022-02-02 10:40:14,841 env.py: 50: COLAB_GPU: 1 INFO 2022-02-02 10:40:14,841 env.py: 50: CUDA_VERSION: 11.1.1 INFO 2022-02-02 10:40:14,842 env.py: 50: CUDNN_VERSION: 8.0.5.39 INFO 2022-02-02 10:40:14,842 env.py: 50: DATALAB_SETTINGS_OVERRIDES: {"kernelManagerProxyPort":6000,"kernelManagerProxyHost":"172.28.0.3","jupyterArgs":["--ip=\"172.28.0.2\""],"debugAdapterMultiplexerPath":"/usr/local/bin/dap_multiplexer","enableLsp":true} INFO 2022-02-02 10:40:14,842 env.py: 50: DEBIAN_FRONTEND: noninteractive INFO 2022-02-02 10:40:14,842 env.py: 50: ENABLE_DIRECTORYPREFETCHER: 1 INFO 2022-02-02 10:40:14,842 env.py: 50: ENV: /root/.bashrc INFO 2022-02-02 10:40:14,842 env.py: 50: GCE_METADATA_TIMEOUT: 0 INFO 2022-02-02 10:40:14,842 env.py: 50: GCS_READ_CACHE_BLOCK_SIZE_MB: 16 INFO 2022-02-02 10:40:14,842 env.py: 50: GIT_PAGER: cat INFO 2022-02-02 10:40:14,842 env.py: 50: GLIBCPP_FORCE_NEW: 1 INFO 2022-02-02 10:40:14,843 env.py: 50: GLIBCXX_FORCE_NEW: 1 INFO 2022-02-02 10:40:14,843 env.py: 50: HOME: /root INFO 2022-02-02 10:40:14,843 env.py: 50: HOSTNAME: 315a685d3ad9 INFO 2022-02-02 10:40:14,843 env.py: 50: JPY_PARENT_PID: 61 INFO 2022-02-02 10:40:14,843 env.py: 50: LANG: en_US.UTF-8 INFO 2022-02-02 10:40:14,843 env.py: 50: LAST_FORCED_REBUILD: 20220106 INFO 2022-02-02 10:40:14,843 env.py: 50: LD_LIBRARY_PATH: /usr/lib64-nvidia INFO 2022-02-02 10:40:14,843 env.py: 50: LD_PRELOAD: /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 INFO 2022-02-02 10:40:14,843 env.py: 50: LIBRARY_PATH: /usr/local/cuda/lib64/stubs INFO 2022-02-02 10:40:14,843 env.py: 50: LOCAL_RANK: 0 INFO 2022-02-02 10:40:14,844 env.py: 50: MPLBACKEND: module://ipykernel.pylab.backend_inline INFO 2022-02-02 10:40:14,844 env.py: 50: NCCL_VERSION: 2.7.8 INFO 2022-02-02 10:40:14,844 env.py: 50: NO_GCE_CHECK: True INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_DRIVER_CAPABILITIES: compute,utility INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_REQUIRE_CUDA: cuda>=11.1 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451 INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_VISIBLE_DEVICES: all INFO 2022-02-02 10:40:14,844 env.py: 50: OLDPWD: / INFO 2022-02-02 10:40:14,844 env.py: 50: PAGER: cat INFO 2022-02-02 10:40:14,844 env.py: 50: PATH: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin INFO 2022-02-02 10:40:14,845 env.py: 50: PWD: /content/vissl INFO 2022-02-02 10:40:14,845 env.py: 50: PYDEVD_USE_FRAME_EVAL: NO INFO 2022-02-02 10:40:14,845 env.py: 50: PYTHONPATH: /env/python INFO 2022-02-02 10:40:14,845 env.py: 50: PYTHONWARNINGS: ignore:::pip._internal.cli.base_command INFO 2022-02-02 10:40:14,845 env.py: 50: RANK: 0 INFO 2022-02-02 10:40:14,845 env.py: 50: SHELL: /bin/bash INFO 2022-02-02 10:40:14,845 env.py: 50: SHLVL: 1 INFO 2022-02-02 10:40:14,845 env.py: 50: TBE_CREDS_ADDR: 172.28.0.1:8008 INFO 2022-02-02 10:40:14,845 env.py: 50: TBE_EPHEM_CREDS_ADDR: 172.28.0.1:8009 INFO 2022-02-02 10:40:14,846 env.py: 50: TERM: xterm-color INFO 2022-02-02 10:40:14,846 env.py: 50: TF_FORCE_GPU_ALLOW_GROWTH: true INFO 2022-02-02 10:40:14,846 env.py: 50: USE_EPHEM: 1 INFO 2022-02-02 10:40:14,846 env.py: 50: WORLDSIZE: 1 INFO 2022-02-02 10:40:14,846 env.py: 50: : /usr/bin/python3 INFO 2022-02-02 10:40:14,846 env.py: 50: __EGL_VENDOR_LIBRARY_DIRS: /usr/lib64-nvidia:/usr/share/glvnd/egl_vendor.d/ INFO 2022-02-02 10:40:14,846 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2022-02-02 10:40:14,846 train.py: 105: Setting seed.... INFO 2022-02-02 10:40:14,846 misc.py: 173: MACHINE SEED: 0 INFO 2022-02-02 10:40:14,852 hydra_config.py: 132: Training with config: INFO 2022-02-02 10:40:14,863 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/content/checkpoints', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 32, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_filelist'], 'DATA_LIMIT': 500, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': True, 'PERF_STAT_FREQUENCY': 10, 'ROLLING_BTIME_FREQ': 5}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': True}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 1, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 64, 'embedding_dim': 128, 'world_size': 1}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O3'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'skip_last_layer_relu_bn': False, 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 2, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': True}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2022-02-02 10:40:15,878 train.py: 117: System config:

sys.platform linux Python 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0] numpy 1.19.5 Pillow 7.1.2 vissl 0.1.6 @/content/vissl/vissl GPU available True GPU 0 Tesla K80 CUDA_HOME /usr/local/cuda torchvision 0.9.0+cu101 @/usr/local/lib/python3.7/dist-packages/torchvision hydra 1.0.7 @/usr/local/lib/python3.7/dist-packages/hydra classy_vision 0.7.0.dev @/usr/local/lib/python3.7/dist-packages/classy_vision tensorboard 2.7.0 apex 0.1 @/usr/local/lib/python3.7/dist-packages/apex cv2 4.1.2 PyTorch 1.8.0+cu101 @/usr/local/lib/python3.7/dist-packages/torch PyTorch debug build False

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.1, CUDNN_VERSION=7.6.3, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

CPU info:

Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 2 On-line CPU(s) list 0,1 Thread(s) per core 2 Core(s) per socket 1 Socket(s) 1 NUMA node(s) 1 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU @ 2.20GHz Stepping 0 CPU MHz 2200.000 BogoMIPS 4400.00 Hypervisor vendor KVM Virtualization type full L1d cache 32K L1i cache 32K L2 cache 256K L3 cache 56320K NUMA node0 CPU(s) 0,1

INFO 2022-02-02 10:40:15,878 tensorboard.py: 49: Tensorboard dir: /content/checkpoints/tb_logs INFO 2022-02-02 10:40:20,647 tensorboard_hook.py: 90: Setting up SSL Tensorboard Hook... INFO 2022-02-02 10:40:20,648 tensorboard_hook.py: 103: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0 /usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py:44: UserWarning: Profiler won't be using warmup, this can skew profiler results warn("Profiler won't be using warmup, this can skew profiler results") INFO 2022-02-02 10:40:20,648 trainer_main.py: 113: Using Distributed init method: tcp://localhost:43839, world_size: 1, rank: 0 INFO 2022-02-02 10:40:20,650 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0 INFO 2022-02-02 10:40:20,650 trainer_main.py: 134: | initialized host 315a685d3ad9 as rank 0 (0) INFO 2022-02-02 10:40:22,885 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-02-02 10:40:22,886 train_task.py: 455: Building model.... INFO 2022-02-02 10:40:22,886 resnext.py: 68: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-02-02 10:40:22,886 resnext.py: 88: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-02-02 10:40:23,693 model_helpers.py: 177: Using SyncBN group size: 1 INFO 2022-02-02 10:40:23,693 model_helpers.py: 192: Converting BN layers to PyTorch SyncBN INFO 2022-02-02 10:40:23,694 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 0 INFO 2022-02-02 10:40:23,703 train_task.py: 657: Broadcast model BN buffers from primary on every forward pass INFO 2022-02-02 10:40:23,703 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-02-02 10:40:23,750 optimizer_helper.py: 294: Trainable params: 163, Non-Trainable params: 0, Trunk Regularized Parameters: 53, Trunk Unregularized Parameters 106, Head Regularized Parameters: 4, Head Unregularized Parameters: 0 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-02-02 10:40:23,751 img_replicate_pil.py: 52: ImgReplicatePil | Using num_times: 2 INFO 2022-02-02 10:40:23,751 img_pil_color_distortion.py: 56: ImgPilColorDistortion | Using strength: 1.0 INFO 2022-02-02 10:40:23,752 ssl_dataset.py: 157: Rank: 0 split: TRAIN Data files: [''] INFO 2022-02-02 10:40:23,752 ssl_dataset.py: 160: Rank: 0 split: TRAIN Label files: [] INFO 2022-02-02 10:40:23,752 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2022-02-02 10:40:23,752 init.py: 126: Created the Distributed Sampler.... INFO 2022-02-02 10:40:23,752 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) INFO 2022-02-02 10:40:23,753 init.py: 215: Wrapping the dataloader to async device copies INFO 2022-02-02 10:40:23,753 train_task.py: 384: Building loss... INFO 2022-02-02 10:40:23,764 simclr_info_nce_loss.py: 91: Creating Info-NCE loss on Rank: 0 INFO 2022-02-02 10:40:23,764 trainer_main.py: 268: Training 2 epochs INFO 2022-02-02 10:40:23,764 trainer_main.py: 269: One epoch = 15 iterations. INFO 2022-02-02 10:40:23,765 trainer_main.py: 270: Total 500 samples in one epoch INFO 2022-02-02 10:40:23,765 trainer_main.py: 276: Total 30 iterations for training INFO 2022-02-02 10:40:23,863 logger.py: 84: Wed Feb 2 10:40:23 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.46 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 | | N/A 57C P0 73W / 149W | 564MiB / 11441MiB | 9% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

INFO 2022-02-02 10:40:23,863 profiler.py: 58: Entering profiler... INFO 2022-02-02 10:40:24,203 trainer_main.py: 173: Model is: Classy <class 'vissl.models.base_ssl_model.BaseSSLMultiInputOutputModel'>: BaseSSLMultiInputOutputModel( (_heads): ModuleDict() (trunk): ResNeXt( (_feature_blocks): ModuleDict( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv1_relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), bias=False) (1): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (flatten): Flatten() ) ) (heads): ModuleList( (0): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=2048, bias=True) (1): ReLU(inplace=True) ) ) (1): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=128, bias=True) ) ) ) ) INFO 2022-02-02 10:40:24,206 trainer_main.py: 174: Loss is: { 'info_average': { 'dist_rank': 0, 'name': 'SimclrInfoNCECriterion', 'num_negatives': 62, 'num_pos': 2, 'temperature': 0.1}, 'name': 'SimclrInfoNCELoss'} INFO 2022-02-02 10:40:24,206 trainer_main.py: 175: Starting training.... INFO 2022-02-02 10:40:24,206 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

INFO 2022-02-02 10:40:31,986 trainer_main.py: 333: Phase advanced. Rank: 0 INFO 2022-02-02 10:40:31,988 log_hooks.py: 77: ========= Memory Summary at on_phase_start =======	===========================================================================		PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0	cudaMalloc retries: 0
===========================================================================
Metric	Cur Usage	Peak Usage	Tot Alloc	Tot Freed
---------------------------------------------------------------------------
Allocated memory	147151 KB	147151 KB	147152 KB	512 B
from large pool	128256 KB	128256 KB	128256 KB	0 B
from small pool	18895 KB	18895 KB	18896 KB	512 B
---------------------------------------------------------------------------
Active memory	147151 KB	147151 KB	147152 KB	512 B
from large pool	128256 KB	128256 KB	128256 KB	0 B
from small pool	18895 KB	18895 KB	18896 KB	512 B
---------------------------------------------------------------------------
GPU reserved memory	163840 KB	163840 KB	163840 KB	0 B
from large pool	141312 KB	141312 KB	141312 KB	0 B
from small pool	22528 KB	22528 KB	22528 KB	0 B
---------------------------------------------------------------------------
Non-releasable memory	16688 KB	28795 KB	92570 KB	75881 KB
from large pool	13056 KB	28160 KB	75776 KB	62720 KB
from small pool	3632 KB	3633 KB	16794 KB	13161 KB
---------------------------------------------------------------------------
Allocations	328	328	329	1
from large pool	19	19	19	0
from small pool	309	309	310	1
---------------------------------------------------------------------------
Active allocs	328	328	329	1
from large pool	19	19	19	0
from small pool	309	309	310	1
---------------------------------------------------------------------------
GPU reserved segments	17	17	17	0
from large pool	6	6	6	0
from small pool	11	11	11	0
---------------------------------------------------------------------------
Non-releasable allocs	9	9	18	9
from large pool	5	5	6	1
from small pool	4	5	12	8
===========================================================================

INFO 2022-02-02 10:40:37,293 state_update_hooks.py: 115: Starting phase 0 [train] INFO 2022-02-02 10:40:42,863 log_hooks.py: 77: ========= Memory Summary at on_forward =======	===========================================================================		PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0	cudaMalloc retries: 0
===========================================================================
Metric	Cur Usage	Peak Usage	Tot Alloc	Tot Freed
---------------------------------------------------------------------------
Allocated memory	5390 MB	7844 MB	25886 MB	20496 MB
from large pool	5370 MB	7825 MB	25866 MB	20496 MB
from small pool	19 MB	19 MB	19 MB	0 MB
---------------------------------------------------------------------------
Active memory	5390 MB	7844 MB	25886 MB	20496 MB
from large pool	5370 MB	7825 MB	25866 MB	20496 MB
from small pool	19 MB	19 MB	19 MB	0 MB
---------------------------------------------------------------------------
GPU reserved memory	8184 MB	9330 MB	15628 MB	7444 MB
from large pool	8162 MB	9308 MB	15606 MB	7444 MB
from small pool	22 MB	22 MB	22 MB	0 MB
---------------------------------------------------------------------------
Non-releasable memory	202300 KB	1563 MB	10613 MB	10415 MB
from large pool	199936 KB	1559 MB	10597 MB	10401 MB
from small pool	2364 KB	3 MB	16 MB	14 MB
---------------------------------------------------------------------------
Allocations	545	545	662	117
from large pool	124	124	164	40
from small pool	421	421	498	77
---------------------------------------------------------------------------
Active allocs	545	545	662	117
from large pool	124	124	164	40
from small pool	421	421	498	77
---------------------------------------------------------------------------
GPU reserved segments	32	33	37	5
from large pool	21	22	26	5
from small pool	11	11	11	0
---------------------------------------------------------------------------
Non-releasable allocs	18	20	96	78
from large pool	14	15	30	16
from small pool	4	5	66	62
===========================================================================

INFO 2022-02-02 10:40:52,003 log_hooks.py: 77: ========= Memory Summary at on_backward =======	===========================================================================		PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0	cudaMalloc retries: 0
===========================================================================
Metric	Cur Usage	Peak Usage	Tot Alloc	Tot Freed
---------------------------------------------------------------------------
Allocated memory	299186 KB	8008 MB	69912 MB	69620 MB
from large pool	261632 KB	7988 MB	69872 MB	69616 MB
from small pool	37554 KB	36 MB	40 MB	3 MB
---------------------------------------------------------------------------
Active memory	299186 KB	8008 MB	69912 MB	69620 MB
from large pool	261632 KB	7988 MB	69872 MB	69616 MB
from small pool	37554 KB	36 MB	40 MB	3 MB
---------------------------------------------------------------------------
GPU reserved memory	3226 MB	9330 MB	21396 MB	18170 MB
from large pool	3186 MB	9308 MB	21356 MB	18170 MB
from small pool	40 MB	40 MB	40 MB	0 MB
---------------------------------------------------------------------------
Non-releasable memory	2933 MB	3791 MB	40967 MB	38033 MB
from large pool	2930 MB	3787 MB	40936 MB	38006 MB
from small pool	3 MB	4 MB	30 MB	27 MB
---------------------------------------------------------------------------
Allocations	498	560	1146	648
from large pool	38	128	417	379
from small pool	460	462	729	269
---------------------------------------------------------------------------
Active allocs	498	560	1146	648
from large pool	38	128	417	379
from small pool	460	462	729	269
---------------------------------------------------------------------------
GPU reserved segments	31	38	51	20
from large pool	11	22	31	20
from small pool	20	20	20	0
---------------------------------------------------------------------------
Non-releasable allocs	26	29	344	318
from large pool	13	20	162	149
from small pool	13	13	182	169
===========================================================================

INFO 2022-02-02 10:40:52,216 log_hooks.py: 77: ========= Memory Summary at on_update =======	===========================================================================		PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0	cudaMalloc retries: 0
===========================================================================
Metric	Cur Usage	Peak Usage	Tot Alloc	Tot Freed
---------------------------------------------------------------------------
Allocated memory	410227 KB	8008 MB	70128 MB	69727 MB
from large pool	354048 KB	7988 MB	70051 MB	69705 MB
from small pool	56179 KB	54 MB	77 MB	22 MB
---------------------------------------------------------------------------
Active memory	410227 KB	8008 MB	70128 MB	69727 MB
from large pool	354048 KB	7988 MB	70051 MB	69705 MB
from small pool	56179 KB	54 MB	77 MB	22 MB
---------------------------------------------------------------------------
GPU reserved memory	3244 MB	9330 MB	21414 MB	18170 MB
from large pool	3186 MB	9308 MB	21356 MB	18170 MB
from small pool	58 MB	58 MB	58 MB	0 MB
---------------------------------------------------------------------------
Non-releasable memory	2843 MB	3791 MB	41085 MB	38241 MB
from large pool	2840 MB	3787 MB	41025 MB	38184 MB
from small pool	3 MB	4 MB	60 MB	57 MB
---------------------------------------------------------------------------
Allocations	661	664	3038	2377
from large pool	56	128	453	397
from small pool	605	608	2585	1980
---------------------------------------------------------------------------
Active allocs	661	664	3038	2377
from large pool	56	128	453	397
from small pool	605	608	2585	1980
---------------------------------------------------------------------------
GPU reserved segments	40	40	60	20
from large pool	11	22	31	20
from small pool	29	29	29	0
---------------------------------------------------------------------------
Non-releasable allocs	17	29	1794	1777
from large pool	11	20	162	151
from small pool	6	16	1632	1626
===========================================================================

INFO 2022-02-02 10:40:52,216 tensorboard_hook.py: 236: Logging Parameter gradients. Iteration 0 INFO 2022-02-02 10:40:55,701 tensorboard_hook.py: 255: Logging metrics. Iteration 0 INFO 2022-02-02 10:40:55,706 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 0; lr: 0.6; loss: 4.14865; btime(ms): 0; eta: 0:00:00; peak_mem(M): 8008; INFO 2022-02-02 10:40:57,472 tensorboard_hook.py: 255: Logging metrics. Iteration 1 INFO 2022-02-02 10:40:57,475 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 1; lr: 2.0; loss: 4.28503; btime(ms): 31942; eta: 0:15:26; peak_mem(M): 8008; max_iterations: 30; INFO 2022-02-02 10:40:59,249 tensorboard_hook.py: 255: Logging metrics. Iteration 2 INFO 2022-02-02 10:40:59,251 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 2; lr: 3.4; loss: 4.20582; btime(ms): 16855; eta: 0:07:51; peak_mem(M): 8008; INFO 2022-02-02 10:41:01,009 tensorboard_hook.py: 255: Logging metrics. Iteration 3 INFO 2022-02-02 10:41:01,011 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 3; lr: 3.59685; loss: 4.2537; btime(ms): 11829; eta: 0:05:19; peak_mem(M): 8008; INFO 2022-02-02 10:41:02,793 tensorboard_hook.py: 255: Logging metrics. Iteration 4 INFO 2022-02-02 10:41:02,795 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 4; lr: 3.48896; loss: 4.14138; btime(ms): 9311; eta: 0:04:02; peak_mem(M): 8008; INFO 2022-02-02 10:41:04,538 tensorboard_hook.py: 255: Logging metrics. Iteration 5 INFO 2022-02-02 10:41:04,540 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 5; lr: 3.17827; loss: 4.15546; btime(ms): 7806; eta: 0:03:15; peak_mem(M): 8008; btime(5iters)(ms): 7806; rolling_eta: 0:03:15; INFO 2022-02-02 10:41:06,290 tensorboard_hook.py: 255: Logging metrics. Iteration 6 INFO 2022-02-02 10:41:06,293 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 6; lr: 2.70209; loss: 4.15095; btime(ms): 6796; eta: 0:02:43; peak_mem(M): 8008; btime(5iters)(ms): 1766; rolling_eta: 0:00:42; INFO 2022-02-02 10:41:08,043 tensorboard_hook.py: 255: Logging metrics. Iteration 7 INFO 2022-02-02 10:41:08,047 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 7; lr: 2.11763; loss: 4.14471; btime(ms): 6075; eta: 0:02:19; peak_mem(M): 8008; btime(5iters)(ms): 1763; rolling_eta: 0:00:40; INFO 2022-02-02 10:41:09,842 tensorboard_hook.py: 255: Logging metrics. Iteration 8 INFO 2022-02-02 10:41:09,844 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 8; lr: 1.49511; loss: 4.14161; btime(ms): 5535; eta: 0:02:01; peak_mem(M): 8008; btime(5iters)(ms): 1759; rolling_eta: 0:00:38; INFO 2022-02-02 10:41:10,360 log_hooks.py: 568: Average train batch time (ms) for 10 batches: 3837 INFO 2022-02-02 10:41:10,361 log_hooks.py: 577: Train step time breakdown (rank 0): Timer Host CudaEvent read_sample: 0.87 ms 0.90 ms forward: 590.73 ms 1017.52 ms loss_compute: 1.87 ms 1.85 ms loss_all_reduce: 0.18 ms 0.19 ms backward: 709.24 ms 1630.28 ms optimizer_step: 1080.57 ms 160.52 ms train_step_total: 3616.21 ms 3616.25 ms INFO 2022-02-02 10:41:11,583 tensorboard_hook.py: 255: Logging metrics. Iteration 9 INFO 2022-02-02 10:45:47,389 profiler.py: 63: Exiting profiler... INFO 2022-02-02 10:45:47,390 profiler.py: 47: Exporting cpu_time to: /content/vissl/cpu_time_rank0.txt INFO 2022-02-02 10:45:47,391 profiler.py: 47: Exporting cuda_time to: /content/vissl/cuda_time_rank0.txt INFO 2022-02-02 10:45:47,391 profiler.py: 47: Exporting cuda_memory_usage to: /content/vissl/cuda_memory_usage_rank0.txt INFO 2022-02-02 10:45:47,392 profiler.py: 54: Exporting profiling chrome traces to: /content/vissl/profiler_chrome_trace_rank0.json INFO 2022-02-02 10:45:47,397 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 9; lr: 0.90932; loss: 4.14621; btime(ms): 5120; eta: 0:01:47; peak_mem(M): 8008; btime(5iters)(ms): 1766; rolling_eta: 0:00:37; INFO 2022-02-02 10:45:50,226 tensorboard_hook.py: 255: Logging metrics. Iteration 10 INFO 2022-02-02 10:45:50,232 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 10; lr: 0.43064; loss: 4.14273; btime(ms): 32363; eta: 0:10:47; peak_mem(M): 8008; btime(5iters)(ms): 56920; rolling_eta: 0:18:58; INFO 2022-02-02 10:45:51,921 tensorboard_hook.py: 255: Logging metrics. Iteration 11 INFO 2022-02-02 10:45:51,925 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 11; lr: 0.11656; loss: 4.13945; btime(ms): 29678; eta: 0:09:23; peak_mem(M): 8008; btime(5iters)(ms): 57138; rolling_eta: 0:18:05; INFO 2022-02-02 10:45:53,609 tensorboard_hook.py: 255: Logging metrics. Iteration 12 INFO 2022-02-02 10:45:53,614 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 12; lr: 0.00484; loss: 4.15122; btime(ms): 27346; eta: 0:08:12; peak_mem(M): 8008; btime(5iters)(ms): 57126; rolling_eta: 0:17:08; INFO 2022-02-02 10:45:55,288 tensorboard_hook.py: 255: Logging metrics. Iteration 13 INFO 2022-02-02 10:45:55,292 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 13; lr: 0.03928; loss: 4.14357; btime(ms): 25373; eta: 0:07:11; peak_mem(M): 8008; btime(5iters)(ms): 57113; rolling_eta: 0:16:10; INFO 2022-02-02 10:45:57,033 tensorboard_hook.py: 255: Logging metrics. Iteration 14 INFO 2022-02-02 10:45:57,038 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 14; lr: 0.1411; loss: 4.14553; btime(ms): 23680; eta: 0:06:18; peak_mem(M): 8008; btime(5iters)(ms): 57089; rolling_eta: 0:15:13; INFO 2022-02-02 10:45:57,039 trainer_main.py: 214: Meters synced INFO 2022-02-02 10:45:57,039 log_hooks.py: 568: Average train batch time (ms) for 15 batches: 21670 INFO 2022-02-02 10:45:57,039 log_hooks.py: 577: Train step time breakdown (rank 0): Timer Host CudaEvent read_sample: 4.65 ms 4.69 ms forward: 477.04 ms 925.43 ms loss_compute: 1.63 ms 1.61 ms loss_all_reduce: 0.16 ms 0.17 ms backward: 437.75 ms 1408.87 ms optimizer_step: 1106.35 ms 136.86 ms train_step_total: 21315.80 ms 21316.05 ms /usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:263: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. FutureWarning) /usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:289: FutureWarning: torch.cuda.reset_max_memory_cached now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. FutureWarning) INFO 2022-02-02 10:46:04,016 io.py: 63: Saving data to file: /content/checkpoints/metrics.json INFO 2022-02-02 10:46:04,017 io.py: 89: Saved data to file: /content/checkpoints/metrics.json INFO 2022-02-02 10:46:04,017 log_hooks.py: 426: [phase: 0] Saving checkpoint to /content/checkpoints INFO 2022-02-02 10:46:04,612 checkpoint.py: 131: Saved checkpoint: /content/checkpoints/model_phase0.torch INFO 2022-02-02 10:46:04,612 checkpoint.py: 140: Creating symlink... INFO 2022-02-02 10:46:04,613 checkpoint.py: 144: Created symlink: /content/checkpoints/checkpoint.torch INFO 2022-02-02 10:46:04,613 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 1, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

INFO 2022-02-02 10:46:13,134 trainer_main.py: 333: Phase advanced. Rank: 0 INFO 2022-02-02 10:46:13,135 state_update_hooks.py: 115: Starting phase 1 [train] INFO 2022-02-02 10:46:15,175 tensorboard_hook.py: 255: Logging metrics. Iteration 15 INFO 2022-02-02 10:46:15,184 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 15; lr: 0.29803; loss: 4.14155; btime(ms): 22218; eta: 0:05:33; peak_mem(M): 5676; btime(5iters)(ms): 1928; rolling_eta: 0:00:28; INFO 2022-02-02 10:46:16,944 tensorboard_hook.py: 255: Logging metrics. Iteration 16 INFO 2022-02-02 10:46:16,949 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 16; lr: 0.49122; loss: 4.14258; btime(ms): 21963; eta: 0:05:07; peak_mem(M): 5676; btime(5iters)(ms): 4990; rolling_eta: 0:01:09; INFO 2022-02-02 10:46:18,640 tensorboard_hook.py: 255: Logging metrics. Iteration 17 INFO 2022-02-02 10:46:18,645 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 17; lr: 0.69747; loss: 4.14387; btime(ms): 20775; eta: 0:04:30; peak_mem(M): 5676; btime(5iters)(ms): 5004; rolling_eta: 0:01:05; INFO 2022-02-02 10:46:20,348 tensorboard_hook.py: 255: Logging metrics. Iteration 18 INFO 2022-02-02 10:46:20,353 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 18; lr: 0.89198; loss: 4.1382; btime(ms): 19715; eta: 0:03:56; peak_mem(M): 5676; btime(5iters)(ms): 5006; rolling_eta: 0:01:00; INFO 2022-02-02 10:46:22,029 tensorboard_hook.py: 255: Logging metrics. Iteration 19 --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step task.run_hooks(SSLClassyHookFunctions.on_update.name) File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks getattr(hook, hook_function_name, ClassyHook._noop)(self) File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update self._runtime_tracking(iteration, task) File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking self.runtime_profiler.step() File "/content/vissl/vissl/utils/profiler.py", line 32, in step self.profiler.step() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step self._stop_trace() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace self.profiler.exit(None, None, None) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit self.kineto_results = torch.autograd._disable_profiler() RuntimeError: Expected debug info of type 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 57, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('Expected debug info of type 2'),) --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step task.run_hooks(SSLClassyHookFunctions.on_update.name) File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks getattr(hook, hook_function_name, ClassyHook._noop)(self) File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update self._runtime_tracking(iteration, task) File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking self.runtime_profiler.step() File "/content/vissl/vissl/utils/profiler.py", line 32, in step self.profiler.step() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step self._stop_trace() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace self.profiler.exit(None, None, None) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit self.kineto_results = torch.autograd._disable_profiler() RuntimeError: Expected debug info of type 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 57, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('Expected debug info of type 2'),) Traceback (most recent call last): File "tools/run_distributed_engines.py", line 57, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 164, in launch_distributed raise e File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step task.run_hooks(SSLClassyHookFunctions.on_update.name) File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks getattr(hook, hook_function_name, ClassyHook._noop)(self) File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update self._runtime_tracking(iteration, task) File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking self.runtime_profiler.step() File "/content/vissl/vissl/utils/profiler.py", line 32, in step self.profiler.step() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step self._stop_trace() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace self.profiler.exit(None, None, None) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit self.kineto_results = torch.autograd._disable_profiler() RuntimeError: Expected debug info of type 2


4. please simplify the steps as much as possible so they do not require additional resources to
   run, such as a private dataset.

## Expected behavior:

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

## Environment:

The aforementioned official tutorial.

## When to expect Triage

VISSL devs and contributors aim to triage issues asap however, as a general guideline, we ask users to expect triaging in 1-2 weeks.

facebookresearch / vissl

Error when using profiling: "RuntimeError: Expected debug info of type 2" #514

Instructions To Reproduce the 🐛 Bug: