facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.24k stars 330 forks source link

Error when using profiling: "RuntimeError: Expected debug info of type 2" #514

Open blazejdolicki opened 2 years ago

blazejdolicki commented 2 years ago

Instructions To Reproduce the 🐛 Bug:

  1. what changes you made (git diff) or what code you wrote I opened this tutorial and in the cell that executes run_distributed_engines.py I added config.PROFILING.RUNTIME_PROFILING.USE_PROFILER=true to enable the profiler.
  2. what exact command you run: After this change, I ran the notebook and got an error
  3. what you observed (including full logs): You can open the colab notebook here and see all the outputs
    
    ** fvcore version of PathManager will be deprecated soon. **
    ** Please migrate to the version in iopath repo. **
    https://github.com/facebookresearch/iopath 

####### overrides: ['hydra.verbose=true', 'config=test/integration_test/quick_simclr.yaml', 'config.DATA.TRAIN.DATA_SOURCES=[synthetic]', 'config.CHECKPOINT.DIR=/content/checkpoints', 'config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true', 'config.PROFILING.RUNTIME_PROFILING.USE_PROFILER=true'] INFO 2022-02-02 10:40:14,840 distributed_launcher.py: 184: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:43839 INFO 2022-02-02 10:40:14,841 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2022-02-02 10:40:14,841 env.py: 50: CLICOLOR: 1 INFO 2022-02-02 10:40:14,841 env.py: 50: CLOUDSDK_CONFIG: /content/.config INFO 2022-02-02 10:40:14,841 env.py: 50: CLOUDSDK_PYTHON: python3 INFO 2022-02-02 10:40:14,841 env.py: 50: COLAB_GPU: 1 INFO 2022-02-02 10:40:14,841 env.py: 50: CUDA_VERSION: 11.1.1 INFO 2022-02-02 10:40:14,842 env.py: 50: CUDNN_VERSION: 8.0.5.39 INFO 2022-02-02 10:40:14,842 env.py: 50: DATALAB_SETTINGS_OVERRIDES: {"kernelManagerProxyPort":6000,"kernelManagerProxyHost":"172.28.0.3","jupyterArgs":["--ip=\"172.28.0.2\""],"debugAdapterMultiplexerPath":"/usr/local/bin/dap_multiplexer","enableLsp":true} INFO 2022-02-02 10:40:14,842 env.py: 50: DEBIAN_FRONTEND: noninteractive INFO 2022-02-02 10:40:14,842 env.py: 50: ENABLE_DIRECTORYPREFETCHER: 1 INFO 2022-02-02 10:40:14,842 env.py: 50: ENV: /root/.bashrc INFO 2022-02-02 10:40:14,842 env.py: 50: GCE_METADATA_TIMEOUT: 0 INFO 2022-02-02 10:40:14,842 env.py: 50: GCS_READ_CACHE_BLOCK_SIZE_MB: 16 INFO 2022-02-02 10:40:14,842 env.py: 50: GIT_PAGER: cat INFO 2022-02-02 10:40:14,842 env.py: 50: GLIBCPP_FORCE_NEW: 1 INFO 2022-02-02 10:40:14,843 env.py: 50: GLIBCXX_FORCE_NEW: 1 INFO 2022-02-02 10:40:14,843 env.py: 50: HOME: /root INFO 2022-02-02 10:40:14,843 env.py: 50: HOSTNAME: 315a685d3ad9 INFO 2022-02-02 10:40:14,843 env.py: 50: JPY_PARENT_PID: 61 INFO 2022-02-02 10:40:14,843 env.py: 50: LANG: en_US.UTF-8 INFO 2022-02-02 10:40:14,843 env.py: 50: LAST_FORCED_REBUILD: 20220106 INFO 2022-02-02 10:40:14,843 env.py: 50: LD_LIBRARY_PATH: /usr/lib64-nvidia INFO 2022-02-02 10:40:14,843 env.py: 50: LD_PRELOAD: /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 INFO 2022-02-02 10:40:14,843 env.py: 50: LIBRARY_PATH: /usr/local/cuda/lib64/stubs INFO 2022-02-02 10:40:14,843 env.py: 50: LOCAL_RANK: 0 INFO 2022-02-02 10:40:14,844 env.py: 50: MPLBACKEND: module://ipykernel.pylab.backend_inline INFO 2022-02-02 10:40:14,844 env.py: 50: NCCL_VERSION: 2.7.8 INFO 2022-02-02 10:40:14,844 env.py: 50: NO_GCE_CHECK: True INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_DRIVER_CAPABILITIES: compute,utility INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_REQUIRE_CUDA: cuda>=11.1 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451 INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_VISIBLE_DEVICES: all INFO 2022-02-02 10:40:14,844 env.py: 50: OLDPWD: / INFO 2022-02-02 10:40:14,844 env.py: 50: PAGER: cat INFO 2022-02-02 10:40:14,844 env.py: 50: PATH: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin INFO 2022-02-02 10:40:14,845 env.py: 50: PWD: /content/vissl INFO 2022-02-02 10:40:14,845 env.py: 50: PYDEVD_USE_FRAME_EVAL: NO INFO 2022-02-02 10:40:14,845 env.py: 50: PYTHONPATH: /env/python INFO 2022-02-02 10:40:14,845 env.py: 50: PYTHONWARNINGS: ignore:::pip._internal.cli.base_command INFO 2022-02-02 10:40:14,845 env.py: 50: RANK: 0 INFO 2022-02-02 10:40:14,845 env.py: 50: SHELL: /bin/bash INFO 2022-02-02 10:40:14,845 env.py: 50: SHLVL: 1 INFO 2022-02-02 10:40:14,845 env.py: 50: TBE_CREDS_ADDR: 172.28.0.1:8008 INFO 2022-02-02 10:40:14,845 env.py: 50: TBE_EPHEM_CREDS_ADDR: 172.28.0.1:8009 INFO 2022-02-02 10:40:14,846 env.py: 50: TERM: xterm-color INFO 2022-02-02 10:40:14,846 env.py: 50: TF_FORCE_GPU_ALLOW_GROWTH: true INFO 2022-02-02 10:40:14,846 env.py: 50: USE_EPHEM: 1 INFO 2022-02-02 10:40:14,846 env.py: 50: WORLDSIZE: 1 INFO 2022-02-02 10:40:14,846 env.py: 50: : /usr/bin/python3 INFO 2022-02-02 10:40:14,846 env.py: 50: __EGL_VENDOR_LIBRARY_DIRS: /usr/lib64-nvidia:/usr/share/glvnd/egl_vendor.d/ INFO 2022-02-02 10:40:14,846 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2022-02-02 10:40:14,846 train.py: 105: Setting seed.... INFO 2022-02-02 10:40:14,846 misc.py: 173: MACHINE SEED: 0 INFO 2022-02-02 10:40:14,852 hydra_config.py: 132: Training with config: INFO 2022-02-02 10:40:14,863 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/content/checkpoints', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 32, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_filelist'], 'DATA_LIMIT': 500, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': True, 'PERF_STAT_FREQUENCY': 10, 'ROLLING_BTIME_FREQ': 5}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': True}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 1, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 64, 'embedding_dim': 128, 'world_size': 1}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O3'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'skip_last_layer_relu_bn': False, 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 2, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': True}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2022-02-02 10:40:15,878 train.py: 117: System config:


sys.platform linux Python 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0] numpy 1.19.5 Pillow 7.1.2 vissl 0.1.6 @/content/vissl/vissl GPU available True GPU 0 Tesla K80 CUDA_HOME /usr/local/cuda torchvision 0.9.0+cu101 @/usr/local/lib/python3.7/dist-packages/torchvision hydra 1.0.7 @/usr/local/lib/python3.7/dist-packages/hydra classy_vision 0.7.0.dev @/usr/local/lib/python3.7/dist-packages/classy_vision tensorboard 2.7.0 apex 0.1 @/usr/local/lib/python3.7/dist-packages/apex cv2 4.1.2 PyTorch 1.8.0+cu101 @/usr/local/lib/python3.7/dist-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 2 On-line CPU(s) list 0,1 Thread(s) per core 2 Core(s) per socket 1 Socket(s) 1 NUMA node(s) 1 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU @ 2.20GHz Stepping 0 CPU MHz 2200.000 BogoMIPS 4400.00 Hypervisor vendor KVM Virtualization type full L1d cache 32K L1i cache 32K L2 cache 256K L3 cache 56320K NUMA node0 CPU(s) 0,1


INFO 2022-02-02 10:40:15,878 tensorboard.py: 49: Tensorboard dir: /content/checkpoints/tb_logs INFO 2022-02-02 10:40:20,647 tensorboard_hook.py: 90: Setting up SSL Tensorboard Hook... INFO 2022-02-02 10:40:20,648 tensorboard_hook.py: 103: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0 /usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py:44: UserWarning: Profiler won't be using warmup, this can skew profiler results warn("Profiler won't be using warmup, this can skew profiler results") INFO 2022-02-02 10:40:20,648 trainer_main.py: 113: Using Distributed init method: tcp://localhost:43839, world_size: 1, rank: 0 INFO 2022-02-02 10:40:20,650 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0 INFO 2022-02-02 10:40:20,650 trainer_main.py: 134: | initialized host 315a685d3ad9 as rank 0 (0) INFO 2022-02-02 10:40:22,885 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-02-02 10:40:22,886 train_task.py: 455: Building model.... INFO 2022-02-02 10:40:22,886 resnext.py: 68: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-02-02 10:40:22,886 resnext.py: 88: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-02-02 10:40:23,693 model_helpers.py: 177: Using SyncBN group size: 1 INFO 2022-02-02 10:40:23,693 model_helpers.py: 192: Converting BN layers to PyTorch SyncBN INFO 2022-02-02 10:40:23,694 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 0 INFO 2022-02-02 10:40:23,703 train_task.py: 657: Broadcast model BN buffers from primary on every forward pass INFO 2022-02-02 10:40:23,703 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-02-02 10:40:23,750 optimizer_helper.py: 294: Trainable params: 163, Non-Trainable params: 0, Trunk Regularized Parameters: 53, Trunk Unregularized Parameters 106, Head Regularized Parameters: 4, Head Unregularized Parameters: 0 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-02-02 10:40:23,751 img_replicate_pil.py: 52: ImgReplicatePil | Using num_times: 2 INFO 2022-02-02 10:40:23,751 img_pil_color_distortion.py: 56: ImgPilColorDistortion | Using strength: 1.0 INFO 2022-02-02 10:40:23,752 ssl_dataset.py: 157: Rank: 0 split: TRAIN Data files: [''] INFO 2022-02-02 10:40:23,752 ssl_dataset.py: 160: Rank: 0 split: TRAIN Label files: [] INFO 2022-02-02 10:40:23,752 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2022-02-02 10:40:23,752 init.py: 126: Created the Distributed Sampler.... INFO 2022-02-02 10:40:23,752 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) INFO 2022-02-02 10:40:23,753 init.py: 215: Wrapping the dataloader to async device copies INFO 2022-02-02 10:40:23,753 train_task.py: 384: Building loss... INFO 2022-02-02 10:40:23,764 simclr_info_nce_loss.py: 91: Creating Info-NCE loss on Rank: 0 INFO 2022-02-02 10:40:23,764 trainer_main.py: 268: Training 2 epochs INFO 2022-02-02 10:40:23,764 trainer_main.py: 269: One epoch = 15 iterations. INFO 2022-02-02 10:40:23,765 trainer_main.py: 270: Total 500 samples in one epoch INFO 2022-02-02 10:40:23,765 trainer_main.py: 276: Total 30 iterations for training INFO 2022-02-02 10:40:23,863 logger.py: 84: Wed Feb 2 10:40:23 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.46 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 | | N/A 57C P0 73W / 149W | 564MiB / 11441MiB | 9% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

INFO 2022-02-02 10:40:23,863 profiler.py: 58: Entering profiler... INFO 2022-02-02 10:40:24,203 trainer_main.py: 173: Model is: Classy <class 'vissl.models.base_ssl_model.BaseSSLMultiInputOutputModel'>: BaseSSLMultiInputOutputModel( (_heads): ModuleDict() (trunk): ResNeXt( (_feature_blocks): ModuleDict( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv1_relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), bias=False) (1): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (flatten): Flatten() ) ) (heads): ModuleList( (0): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=2048, bias=True) (1): ReLU(inplace=True) ) ) (1): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=128, bias=True) ) ) ) ) INFO 2022-02-02 10:40:24,206 trainer_main.py: 174: Loss is: { 'info_average': { 'dist_rank': 0, 'name': 'SimclrInfoNCECriterion', 'num_negatives': 62, 'num_pos': 2, 'temperature': 0.1}, 'name': 'SimclrInfoNCELoss'} INFO 2022-02-02 10:40:24,206 trainer_main.py: 175: Starting training.... INFO 2022-02-02 10:40:24,206 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

INFO 2022-02-02 10:40:31,986 trainer_main.py: 333: Phase advanced. Rank: 0 INFO 2022-02-02 10:40:31,988 log_hooks.py: 77: ========= Memory Summary at on_phase_start ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 147151 KB 147151 KB 147152 KB 512 B
from large pool 128256 KB 128256 KB 128256 KB 0 B
from small pool 18895 KB 18895 KB 18896 KB 512 B
---------------------------------------------------------------------------
Active memory 147151 KB 147151 KB 147152 KB 512 B
from large pool 128256 KB 128256 KB 128256 KB 0 B
from small pool 18895 KB 18895 KB 18896 KB 512 B
---------------------------------------------------------------------------
GPU reserved memory 163840 KB 163840 KB 163840 KB 0 B
from large pool 141312 KB 141312 KB 141312 KB 0 B
from small pool 22528 KB 22528 KB 22528 KB 0 B
---------------------------------------------------------------------------
Non-releasable memory 16688 KB 28795 KB 92570 KB 75881 KB
from large pool 13056 KB 28160 KB 75776 KB 62720 KB
from small pool 3632 KB 3633 KB 16794 KB 13161 KB
---------------------------------------------------------------------------
Allocations 328 328 329 1
from large pool 19 19 19 0
from small pool 309 309 310 1
---------------------------------------------------------------------------
Active allocs 328 328 329 1
from large pool 19 19 19 0
from small pool 309 309 310 1
---------------------------------------------------------------------------
GPU reserved segments 17 17 17 0
from large pool 6 6 6 0
from small pool 11 11 11 0
---------------------------------------------------------------------------
Non-releasable allocs 9 9 18 9
from large pool 5 5 6 1
from small pool 4 5 12 8
===========================================================================
INFO 2022-02-02 10:40:37,293 state_update_hooks.py: 115: Starting phase 0 [train] INFO 2022-02-02 10:40:42,863 log_hooks.py: 77: ========= Memory Summary at on_forward ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 5390 MB 7844 MB 25886 MB 20496 MB
from large pool 5370 MB 7825 MB 25866 MB 20496 MB
from small pool 19 MB 19 MB 19 MB 0 MB
---------------------------------------------------------------------------
Active memory 5390 MB 7844 MB 25886 MB 20496 MB
from large pool 5370 MB 7825 MB 25866 MB 20496 MB
from small pool 19 MB 19 MB 19 MB 0 MB
---------------------------------------------------------------------------
GPU reserved memory 8184 MB 9330 MB 15628 MB 7444 MB
from large pool 8162 MB 9308 MB 15606 MB 7444 MB
from small pool 22 MB 22 MB 22 MB 0 MB
---------------------------------------------------------------------------
Non-releasable memory 202300 KB 1563 MB 10613 MB 10415 MB
from large pool 199936 KB 1559 MB 10597 MB 10401 MB
from small pool 2364 KB 3 MB 16 MB 14 MB
---------------------------------------------------------------------------
Allocations 545 545 662 117
from large pool 124 124 164 40
from small pool 421 421 498 77
---------------------------------------------------------------------------
Active allocs 545 545 662 117
from large pool 124 124 164 40
from small pool 421 421 498 77
---------------------------------------------------------------------------
GPU reserved segments 32 33 37 5
from large pool 21 22 26 5
from small pool 11 11 11 0
---------------------------------------------------------------------------
Non-releasable allocs 18 20 96 78
from large pool 14 15 30 16
from small pool 4 5 66 62
===========================================================================
INFO 2022-02-02 10:40:52,003 log_hooks.py: 77: ========= Memory Summary at on_backward ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 299186 KB 8008 MB 69912 MB 69620 MB
from large pool 261632 KB 7988 MB 69872 MB 69616 MB
from small pool 37554 KB 36 MB 40 MB 3 MB
---------------------------------------------------------------------------
Active memory 299186 KB 8008 MB 69912 MB 69620 MB
from large pool 261632 KB 7988 MB 69872 MB 69616 MB
from small pool 37554 KB 36 MB 40 MB 3 MB
---------------------------------------------------------------------------
GPU reserved memory 3226 MB 9330 MB 21396 MB 18170 MB
from large pool 3186 MB 9308 MB 21356 MB 18170 MB
from small pool 40 MB 40 MB 40 MB 0 MB
---------------------------------------------------------------------------
Non-releasable memory 2933 MB 3791 MB 40967 MB 38033 MB
from large pool 2930 MB 3787 MB 40936 MB 38006 MB
from small pool 3 MB 4 MB 30 MB 27 MB
---------------------------------------------------------------------------
Allocations 498 560 1146 648
from large pool 38 128 417 379
from small pool 460 462 729 269
---------------------------------------------------------------------------
Active allocs 498 560 1146 648
from large pool 38 128 417 379
from small pool 460 462 729 269
---------------------------------------------------------------------------
GPU reserved segments 31 38 51 20
from large pool 11 22 31 20
from small pool 20 20 20 0
---------------------------------------------------------------------------
Non-releasable allocs 26 29 344 318
from large pool 13 20 162 149
from small pool 13 13 182 169
===========================================================================
INFO 2022-02-02 10:40:52,216 log_hooks.py: 77: ========= Memory Summary at on_update ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 410227 KB 8008 MB 70128 MB 69727 MB
from large pool 354048 KB 7988 MB 70051 MB 69705 MB
from small pool 56179 KB 54 MB 77 MB 22 MB
---------------------------------------------------------------------------
Active memory 410227 KB 8008 MB 70128 MB 69727 MB
from large pool 354048 KB 7988 MB 70051 MB 69705 MB
from small pool 56179 KB 54 MB 77 MB 22 MB
---------------------------------------------------------------------------
GPU reserved memory 3244 MB 9330 MB 21414 MB 18170 MB
from large pool 3186 MB 9308 MB 21356 MB 18170 MB
from small pool 58 MB 58 MB 58 MB 0 MB
---------------------------------------------------------------------------
Non-releasable memory 2843 MB 3791 MB 41085 MB 38241 MB
from large pool 2840 MB 3787 MB 41025 MB 38184 MB
from small pool 3 MB 4 MB 60 MB 57 MB
---------------------------------------------------------------------------
Allocations 661 664 3038 2377
from large pool 56 128 453 397
from small pool 605 608 2585 1980
---------------------------------------------------------------------------
Active allocs 661 664 3038 2377
from large pool 56 128 453 397
from small pool 605 608 2585 1980
---------------------------------------------------------------------------
GPU reserved segments 40 40 60 20
from large pool 11 22 31 20
from small pool 29 29 29 0
---------------------------------------------------------------------------
Non-releasable allocs 17 29 1794 1777
from large pool 11 20 162 151
from small pool 6 16 1632 1626
===========================================================================

INFO 2022-02-02 10:40:52,216 tensorboard_hook.py: 236: Logging Parameter gradients. Iteration 0 INFO 2022-02-02 10:40:55,701 tensorboard_hook.py: 255: Logging metrics. Iteration 0 INFO 2022-02-02 10:40:55,706 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 0; lr: 0.6; loss: 4.14865; btime(ms): 0; eta: 0:00:00; peak_mem(M): 8008; INFO 2022-02-02 10:40:57,472 tensorboard_hook.py: 255: Logging metrics. Iteration 1 INFO 2022-02-02 10:40:57,475 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 1; lr: 2.0; loss: 4.28503; btime(ms): 31942; eta: 0:15:26; peak_mem(M): 8008; max_iterations: 30; INFO 2022-02-02 10:40:59,249 tensorboard_hook.py: 255: Logging metrics. Iteration 2 INFO 2022-02-02 10:40:59,251 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 2; lr: 3.4; loss: 4.20582; btime(ms): 16855; eta: 0:07:51; peak_mem(M): 8008; INFO 2022-02-02 10:41:01,009 tensorboard_hook.py: 255: Logging metrics. Iteration 3 INFO 2022-02-02 10:41:01,011 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 3; lr: 3.59685; loss: 4.2537; btime(ms): 11829; eta: 0:05:19; peak_mem(M): 8008; INFO 2022-02-02 10:41:02,793 tensorboard_hook.py: 255: Logging metrics. Iteration 4 INFO 2022-02-02 10:41:02,795 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 4; lr: 3.48896; loss: 4.14138; btime(ms): 9311; eta: 0:04:02; peak_mem(M): 8008; INFO 2022-02-02 10:41:04,538 tensorboard_hook.py: 255: Logging metrics. Iteration 5 INFO 2022-02-02 10:41:04,540 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 5; lr: 3.17827; loss: 4.15546; btime(ms): 7806; eta: 0:03:15; peak_mem(M): 8008; btime(5iters)(ms): 7806; rolling_eta: 0:03:15; INFO 2022-02-02 10:41:06,290 tensorboard_hook.py: 255: Logging metrics. Iteration 6 INFO 2022-02-02 10:41:06,293 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 6; lr: 2.70209; loss: 4.15095; btime(ms): 6796; eta: 0:02:43; peak_mem(M): 8008; btime(5iters)(ms): 1766; rolling_eta: 0:00:42; INFO 2022-02-02 10:41:08,043 tensorboard_hook.py: 255: Logging metrics. Iteration 7 INFO 2022-02-02 10:41:08,047 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 7; lr: 2.11763; loss: 4.14471; btime(ms): 6075; eta: 0:02:19; peak_mem(M): 8008; btime(5iters)(ms): 1763; rolling_eta: 0:00:40; INFO 2022-02-02 10:41:09,842 tensorboard_hook.py: 255: Logging metrics. Iteration 8 INFO 2022-02-02 10:41:09,844 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 8; lr: 1.49511; loss: 4.14161; btime(ms): 5535; eta: 0:02:01; peak_mem(M): 8008; btime(5iters)(ms): 1759; rolling_eta: 0:00:38; INFO 2022-02-02 10:41:10,360 log_hooks.py: 568: Average train batch time (ms) for 10 batches: 3837 INFO 2022-02-02 10:41:10,361 log_hooks.py: 577: Train step time breakdown (rank 0): Timer Host CudaEvent read_sample: 0.87 ms 0.90 ms forward: 590.73 ms 1017.52 ms loss_compute: 1.87 ms 1.85 ms loss_all_reduce: 0.18 ms 0.19 ms backward: 709.24 ms 1630.28 ms optimizer_step: 1080.57 ms 160.52 ms train_step_total: 3616.21 ms 3616.25 ms INFO 2022-02-02 10:41:11,583 tensorboard_hook.py: 255: Logging metrics. Iteration 9 INFO 2022-02-02 10:45:47,389 profiler.py: 63: Exiting profiler... INFO 2022-02-02 10:45:47,390 profiler.py: 47: Exporting cpu_time to: /content/vissl/cpu_time_rank0.txt INFO 2022-02-02 10:45:47,391 profiler.py: 47: Exporting cuda_time to: /content/vissl/cuda_time_rank0.txt INFO 2022-02-02 10:45:47,391 profiler.py: 47: Exporting cuda_memory_usage to: /content/vissl/cuda_memory_usage_rank0.txt INFO 2022-02-02 10:45:47,392 profiler.py: 54: Exporting profiling chrome traces to: /content/vissl/profiler_chrome_trace_rank0.json INFO 2022-02-02 10:45:47,397 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 9; lr: 0.90932; loss: 4.14621; btime(ms): 5120; eta: 0:01:47; peak_mem(M): 8008; btime(5iters)(ms): 1766; rolling_eta: 0:00:37; INFO 2022-02-02 10:45:50,226 tensorboard_hook.py: 255: Logging metrics. Iteration 10 INFO 2022-02-02 10:45:50,232 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 10; lr: 0.43064; loss: 4.14273; btime(ms): 32363; eta: 0:10:47; peak_mem(M): 8008; btime(5iters)(ms): 56920; rolling_eta: 0:18:58; INFO 2022-02-02 10:45:51,921 tensorboard_hook.py: 255: Logging metrics. Iteration 11 INFO 2022-02-02 10:45:51,925 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 11; lr: 0.11656; loss: 4.13945; btime(ms): 29678; eta: 0:09:23; peak_mem(M): 8008; btime(5iters)(ms): 57138; rolling_eta: 0:18:05; INFO 2022-02-02 10:45:53,609 tensorboard_hook.py: 255: Logging metrics. Iteration 12 INFO 2022-02-02 10:45:53,614 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 12; lr: 0.00484; loss: 4.15122; btime(ms): 27346; eta: 0:08:12; peak_mem(M): 8008; btime(5iters)(ms): 57126; rolling_eta: 0:17:08; INFO 2022-02-02 10:45:55,288 tensorboard_hook.py: 255: Logging metrics. Iteration 13 INFO 2022-02-02 10:45:55,292 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 13; lr: 0.03928; loss: 4.14357; btime(ms): 25373; eta: 0:07:11; peak_mem(M): 8008; btime(5iters)(ms): 57113; rolling_eta: 0:16:10; INFO 2022-02-02 10:45:57,033 tensorboard_hook.py: 255: Logging metrics. Iteration 14 INFO 2022-02-02 10:45:57,038 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 14; lr: 0.1411; loss: 4.14553; btime(ms): 23680; eta: 0:06:18; peak_mem(M): 8008; btime(5iters)(ms): 57089; rolling_eta: 0:15:13; INFO 2022-02-02 10:45:57,039 trainer_main.py: 214: Meters synced INFO 2022-02-02 10:45:57,039 log_hooks.py: 568: Average train batch time (ms) for 15 batches: 21670 INFO 2022-02-02 10:45:57,039 log_hooks.py: 577: Train step time breakdown (rank 0): Timer Host CudaEvent read_sample: 4.65 ms 4.69 ms forward: 477.04 ms 925.43 ms loss_compute: 1.63 ms 1.61 ms loss_all_reduce: 0.16 ms 0.17 ms backward: 437.75 ms 1408.87 ms optimizer_step: 1106.35 ms 136.86 ms train_step_total: 21315.80 ms 21316.05 ms /usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:263: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. FutureWarning) /usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:289: FutureWarning: torch.cuda.reset_max_memory_cached now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. FutureWarning) INFO 2022-02-02 10:46:04,016 io.py: 63: Saving data to file: /content/checkpoints/metrics.json INFO 2022-02-02 10:46:04,017 io.py: 89: Saved data to file: /content/checkpoints/metrics.json INFO 2022-02-02 10:46:04,017 log_hooks.py: 426: [phase: 0] Saving checkpoint to /content/checkpoints INFO 2022-02-02 10:46:04,612 checkpoint.py: 131: Saved checkpoint: /content/checkpoints/model_phase0.torch INFO 2022-02-02 10:46:04,612 checkpoint.py: 140: Creating symlink... INFO 2022-02-02 10:46:04,613 checkpoint.py: 144: Created symlink: /content/checkpoints/checkpoint.torch INFO 2022-02-02 10:46:04,613 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 1, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath

INFO 2022-02-02 10:46:13,134 trainer_main.py: 333: Phase advanced. Rank: 0 INFO 2022-02-02 10:46:13,135 state_update_hooks.py: 115: Starting phase 1 [train] INFO 2022-02-02 10:46:15,175 tensorboard_hook.py: 255: Logging metrics. Iteration 15 INFO 2022-02-02 10:46:15,184 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 15; lr: 0.29803; loss: 4.14155; btime(ms): 22218; eta: 0:05:33; peak_mem(M): 5676; btime(5iters)(ms): 1928; rolling_eta: 0:00:28; INFO 2022-02-02 10:46:16,944 tensorboard_hook.py: 255: Logging metrics. Iteration 16 INFO 2022-02-02 10:46:16,949 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 16; lr: 0.49122; loss: 4.14258; btime(ms): 21963; eta: 0:05:07; peak_mem(M): 5676; btime(5iters)(ms): 4990; rolling_eta: 0:01:09; INFO 2022-02-02 10:46:18,640 tensorboard_hook.py: 255: Logging metrics. Iteration 17 INFO 2022-02-02 10:46:18,645 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 17; lr: 0.69747; loss: 4.14387; btime(ms): 20775; eta: 0:04:30; peak_mem(M): 5676; btime(5iters)(ms): 5004; rolling_eta: 0:01:05; INFO 2022-02-02 10:46:20,348 tensorboard_hook.py: 255: Logging metrics. Iteration 18 INFO 2022-02-02 10:46:20,353 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 18; lr: 0.89198; loss: 4.1382; btime(ms): 19715; eta: 0:03:56; peak_mem(M): 5676; btime(5iters)(ms): 5006; rolling_eta: 0:01:00; INFO 2022-02-02 10:46:22,029 tensorboard_hook.py: 255: Logging metrics. Iteration 19 --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step task.run_hooks(SSLClassyHookFunctions.on_update.name) File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks getattr(hook, hook_function_name, ClassyHook._noop)(self) File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update self._runtime_tracking(iteration, task) File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking self.runtime_profiler.step() File "/content/vissl/vissl/utils/profiler.py", line 32, in step self.profiler.step() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step self._stop_trace() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace self.profiler.exit(None, None, None) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit self.kineto_results = torch.autograd._disable_profiler() RuntimeError: Expected debug info of type 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 57, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('Expected debug info of type 2'),) --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step task.run_hooks(SSLClassyHookFunctions.on_update.name) File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks getattr(hook, hook_function_name, ClassyHook._noop)(self) File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update self._runtime_tracking(iteration, task) File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking self.runtime_profiler.step() File "/content/vissl/vissl/utils/profiler.py", line 32, in step self.profiler.step() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step self._stop_trace() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace self.profiler.exit(None, None, None) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit self.kineto_results = torch.autograd._disable_profiler() RuntimeError: Expected debug info of type 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 57, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('Expected debug info of type 2'),) Traceback (most recent call last): File "tools/run_distributed_engines.py", line 57, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 164, in launch_distributed raise e File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step task.run_hooks(SSLClassyHookFunctions.on_update.name) File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks getattr(hook, hook_function_name, ClassyHook._noop)(self) File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update self._runtime_tracking(iteration, task) File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking self.runtime_profiler.step() File "/content/vissl/vissl/utils/profiler.py", line 32, in step self.profiler.step() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step self._stop_trace() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace self.profiler.exit(None, None, None) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit self.kineto_results = torch.autograd._disable_profiler() RuntimeError: Expected debug info of type 2


4. please simplify the steps as much as possible so they do not require additional resources to
   run, such as a private dataset.

## Expected behavior:

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

## Environment:

The aforementioned official tutorial.

## When to expect Triage

VISSL devs and contributors aim to triage issues asap however, as a general guideline, we ask users to expect triaging in 1-2 weeks.
QuentinDuval commented 2 years ago

Hi @blazejdolicki,

Thank you so much for the bug report!

I could indeed reproduce the issue. Interestingly it does not happen on all PyTorch versions. I will try to dig deeper in finding the root cause of this behaviour for PyTorch 1.8 as well as providing you with a fix.

A possible workaround for the moment is to use the legacy profiler using this flag: config.PROFILING.RUNTIME_PROFILING.LEGACY_PROFILER=True

The main issue being that the information output by the legacy profiler is less complete.

Thank you so much again ! Quentin

QuentinDuval commented 2 years ago

Hi @blazejdolicki,

I debugged the issue and found out the cause: when the training continues after the profiling is done, the error pops up because we try to use the profiler again. This only occurs with the new pytorch profiler (after Pytorch 1.8.0).

I am sending a fix right now to deal with this. In the meantime, the workaround is to terminate the training after the profiling is done by adding the following flag: config.PROFILING.STOP_TRAINING_AFTER_PROFILING=True

I will notify you when the fix reaches the main branch as well :)

Thank you again for reporting the bug ! Quentin

QuentinDuval commented 2 years ago

Hi @blazejdolicki,

It should be fixed at 373bf6bf502539a18cc1a1fbbaacdfae77822c83. If you have the chance, please tell me if you still get the issue after this.

Thank you again, Quentin

blazejdolicki commented 2 years ago

Hey Quentin,

Thanks for your reply and providing the fix. I will check in a couple days and let you know if it works.

Best, Blazej