Open blazejdolicki opened 2 years ago
Hi @blazejdolicki,
Thank you so much for the bug report!
I could indeed reproduce the issue. Interestingly it does not happen on all PyTorch versions. I will try to dig deeper in finding the root cause of this behaviour for PyTorch 1.8 as well as providing you with a fix.
A possible workaround for the moment is to use the legacy profiler using this flag: config.PROFILING.RUNTIME_PROFILING.LEGACY_PROFILER=True
The main issue being that the information output by the legacy profiler is less complete.
Thank you so much again ! Quentin
Hi @blazejdolicki,
I debugged the issue and found out the cause: when the training continues after the profiling is done, the error pops up because we try to use the profiler again. This only occurs with the new pytorch profiler (after Pytorch 1.8.0).
I am sending a fix right now to deal with this. In the meantime, the workaround is to terminate the training after the profiling is done by adding the following flag: config.PROFILING.STOP_TRAINING_AFTER_PROFILING=True
I will notify you when the fix reaches the main branch as well :)
Thank you again for reporting the bug ! Quentin
Hi @blazejdolicki,
It should be fixed at 373bf6bf502539a18cc1a1fbbaacdfae77822c83. If you have the chance, please tell me if you still get the issue after this.
Thank you again, Quentin
Hey Quentin,
Thanks for your reply and providing the fix. I will check in a couple days and let you know if it works.
Best, Blazej
Instructions To Reproduce the 🐛 Bug:
git diff
) or what code you wrote I opened this tutorial and in the cell that executesrun_distributed_engines.py
I addedconfig.PROFILING.RUNTIME_PROFILING.USE_PROFILER=true
to enable the profiler.####### overrides: ['hydra.verbose=true', 'config=test/integration_test/quick_simclr.yaml', 'config.DATA.TRAIN.DATA_SOURCES=[synthetic]', 'config.CHECKPOINT.DIR=/content/checkpoints', 'config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true', 'config.PROFILING.RUNTIME_PROFILING.USE_PROFILER=true'] INFO 2022-02-02 10:40:14,840 distributed_launcher.py: 184: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:43839 INFO 2022-02-02 10:40:14,841 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2022-02-02 10:40:14,841 env.py: 50: CLICOLOR: 1 INFO 2022-02-02 10:40:14,841 env.py: 50: CLOUDSDK_CONFIG: /content/.config INFO 2022-02-02 10:40:14,841 env.py: 50: CLOUDSDK_PYTHON: python3 INFO 2022-02-02 10:40:14,841 env.py: 50: COLAB_GPU: 1 INFO 2022-02-02 10:40:14,841 env.py: 50: CUDA_VERSION: 11.1.1 INFO 2022-02-02 10:40:14,842 env.py: 50: CUDNN_VERSION: 8.0.5.39 INFO 2022-02-02 10:40:14,842 env.py: 50: DATALAB_SETTINGS_OVERRIDES: {"kernelManagerProxyPort":6000,"kernelManagerProxyHost":"172.28.0.3","jupyterArgs":["--ip=\"172.28.0.2\""],"debugAdapterMultiplexerPath":"/usr/local/bin/dap_multiplexer","enableLsp":true} INFO 2022-02-02 10:40:14,842 env.py: 50: DEBIAN_FRONTEND: noninteractive INFO 2022-02-02 10:40:14,842 env.py: 50: ENABLE_DIRECTORYPREFETCHER: 1 INFO 2022-02-02 10:40:14,842 env.py: 50: ENV: /root/.bashrc INFO 2022-02-02 10:40:14,842 env.py: 50: GCE_METADATA_TIMEOUT: 0 INFO 2022-02-02 10:40:14,842 env.py: 50: GCS_READ_CACHE_BLOCK_SIZE_MB: 16 INFO 2022-02-02 10:40:14,842 env.py: 50: GIT_PAGER: cat INFO 2022-02-02 10:40:14,842 env.py: 50: GLIBCPP_FORCE_NEW: 1 INFO 2022-02-02 10:40:14,843 env.py: 50: GLIBCXX_FORCE_NEW: 1 INFO 2022-02-02 10:40:14,843 env.py: 50: HOME: /root INFO 2022-02-02 10:40:14,843 env.py: 50: HOSTNAME: 315a685d3ad9 INFO 2022-02-02 10:40:14,843 env.py: 50: JPY_PARENT_PID: 61 INFO 2022-02-02 10:40:14,843 env.py: 50: LANG: en_US.UTF-8 INFO 2022-02-02 10:40:14,843 env.py: 50: LAST_FORCED_REBUILD: 20220106 INFO 2022-02-02 10:40:14,843 env.py: 50: LD_LIBRARY_PATH: /usr/lib64-nvidia INFO 2022-02-02 10:40:14,843 env.py: 50: LD_PRELOAD: /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 INFO 2022-02-02 10:40:14,843 env.py: 50: LIBRARY_PATH: /usr/local/cuda/lib64/stubs INFO 2022-02-02 10:40:14,843 env.py: 50: LOCAL_RANK: 0 INFO 2022-02-02 10:40:14,844 env.py: 50: MPLBACKEND: module://ipykernel.pylab.backend_inline INFO 2022-02-02 10:40:14,844 env.py: 50: NCCL_VERSION: 2.7.8 INFO 2022-02-02 10:40:14,844 env.py: 50: NO_GCE_CHECK: True INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_DRIVER_CAPABILITIES: compute,utility INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_REQUIRE_CUDA: cuda>=11.1 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451 INFO 2022-02-02 10:40:14,844 env.py: 50: NVIDIA_VISIBLE_DEVICES: all INFO 2022-02-02 10:40:14,844 env.py: 50: OLDPWD: / INFO 2022-02-02 10:40:14,844 env.py: 50: PAGER: cat INFO 2022-02-02 10:40:14,844 env.py: 50: PATH: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin INFO 2022-02-02 10:40:14,845 env.py: 50: PWD: /content/vissl INFO 2022-02-02 10:40:14,845 env.py: 50: PYDEVD_USE_FRAME_EVAL: NO INFO 2022-02-02 10:40:14,845 env.py: 50: PYTHONPATH: /env/python INFO 2022-02-02 10:40:14,845 env.py: 50: PYTHONWARNINGS: ignore:::pip._internal.cli.base_command INFO 2022-02-02 10:40:14,845 env.py: 50: RANK: 0 INFO 2022-02-02 10:40:14,845 env.py: 50: SHELL: /bin/bash INFO 2022-02-02 10:40:14,845 env.py: 50: SHLVL: 1 INFO 2022-02-02 10:40:14,845 env.py: 50: TBE_CREDS_ADDR: 172.28.0.1:8008 INFO 2022-02-02 10:40:14,845 env.py: 50: TBE_EPHEM_CREDS_ADDR: 172.28.0.1:8009 INFO 2022-02-02 10:40:14,846 env.py: 50: TERM: xterm-color INFO 2022-02-02 10:40:14,846 env.py: 50: TF_FORCE_GPU_ALLOW_GROWTH: true INFO 2022-02-02 10:40:14,846 env.py: 50: USE_EPHEM: 1 INFO 2022-02-02 10:40:14,846 env.py: 50: WORLDSIZE: 1 INFO 2022-02-02 10:40:14,846 env.py: 50: : /usr/bin/python3 INFO 2022-02-02 10:40:14,846 env.py: 50: __EGL_VENDOR_LIBRARY_DIRS: /usr/lib64-nvidia:/usr/share/glvnd/egl_vendor.d/ INFO 2022-02-02 10:40:14,846 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2022-02-02 10:40:14,846 train.py: 105: Setting seed.... INFO 2022-02-02 10:40:14,846 misc.py: 173: MACHINE SEED: 0 INFO 2022-02-02 10:40:14,852 hydra_config.py: 132: Training with config: INFO 2022-02-02 10:40:14,863 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/content/checkpoints', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 32, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_filelist'], 'DATA_LIMIT': 500, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': True, 'PERF_STAT_FREQUENCY': 10, 'ROLLING_BTIME_FREQ': 5}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': True}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 1, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 64, 'embedding_dim': 128, 'world_size': 1}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O3'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'skip_last_layer_relu_bn': False, 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 2, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': True}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2022-02-02 10:40:15,878 train.py: 117: System config:
sys.platform linux Python 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0] numpy 1.19.5 Pillow 7.1.2 vissl 0.1.6 @/content/vissl/vissl GPU available True GPU 0 Tesla K80 CUDA_HOME /usr/local/cuda torchvision 0.9.0+cu101 @/usr/local/lib/python3.7/dist-packages/torchvision hydra 1.0.7 @/usr/local/lib/python3.7/dist-packages/hydra classy_vision 0.7.0.dev @/usr/local/lib/python3.7/dist-packages/classy_vision tensorboard 2.7.0 apex 0.1 @/usr/local/lib/python3.7/dist-packages/apex cv2 4.1.2 PyTorch 1.8.0+cu101 @/usr/local/lib/python3.7/dist-packages/torch PyTorch debug build False
PyTorch built with:
CPU info:
Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 2 On-line CPU(s) list 0,1 Thread(s) per core 2 Core(s) per socket 1 Socket(s) 1 NUMA node(s) 1 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU @ 2.20GHz Stepping 0 CPU MHz 2200.000 BogoMIPS 4400.00 Hypervisor vendor KVM Virtualization type full L1d cache 32K L1i cache 32K L2 cache 256K L3 cache 56320K NUMA node0 CPU(s) 0,1
INFO 2022-02-02 10:40:15,878 tensorboard.py: 49: Tensorboard dir: /content/checkpoints/tb_logs INFO 2022-02-02 10:40:20,647 tensorboard_hook.py: 90: Setting up SSL Tensorboard Hook... INFO 2022-02-02 10:40:20,648 tensorboard_hook.py: 103: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0 /usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py:44: UserWarning: Profiler won't be using warmup, this can skew profiler results warn("Profiler won't be using warmup, this can skew profiler results") INFO 2022-02-02 10:40:20,648 trainer_main.py: 113: Using Distributed init method: tcp://localhost:43839, world_size: 1, rank: 0 INFO 2022-02-02 10:40:20,650 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0 INFO 2022-02-02 10:40:20,650 trainer_main.py: 134: | initialized host 315a685d3ad9 as rank 0 (0) INFO 2022-02-02 10:40:22,885 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-02-02 10:40:22,886 train_task.py: 455: Building model.... INFO 2022-02-02 10:40:22,886 resnext.py: 68: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-02-02 10:40:22,886 resnext.py: 88: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-02-02 10:40:23,693 model_helpers.py: 177: Using SyncBN group size: 1 INFO 2022-02-02 10:40:23,693 model_helpers.py: 192: Converting BN layers to PyTorch SyncBN INFO 2022-02-02 10:40:23,694 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 0 INFO 2022-02-02 10:40:23,703 train_task.py: 657: Broadcast model BN buffers from primary on every forward pass INFO 2022-02-02 10:40:23,703 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-02-02 10:40:23,750 optimizer_helper.py: 294: Trainable params: 163, Non-Trainable params: 0, Trunk Regularized Parameters: 53, Trunk Unregularized Parameters 106, Head Regularized Parameters: 4, Head Unregularized Parameters: 0 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-02-02 10:40:23,751 img_replicate_pil.py: 52: ImgReplicatePil | Using num_times: 2 INFO 2022-02-02 10:40:23,751 img_pil_color_distortion.py: 56: ImgPilColorDistortion | Using strength: 1.0 INFO 2022-02-02 10:40:23,752 ssl_dataset.py: 157: Rank: 0 split: TRAIN Data files: [''] INFO 2022-02-02 10:40:23,752 ssl_dataset.py: 160: Rank: 0 split: TRAIN Label files: [] INFO 2022-02-02 10:40:23,752 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2022-02-02 10:40:23,752 init.py: 126: Created the Distributed Sampler.... INFO 2022-02-02 10:40:23,752 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) INFO 2022-02-02 10:40:23,753 init.py: 215: Wrapping the dataloader to async device copies INFO 2022-02-02 10:40:23,753 train_task.py: 384: Building loss... INFO 2022-02-02 10:40:23,764 simclr_info_nce_loss.py: 91: Creating Info-NCE loss on Rank: 0 INFO 2022-02-02 10:40:23,764 trainer_main.py: 268: Training 2 epochs INFO 2022-02-02 10:40:23,764 trainer_main.py: 269: One epoch = 15 iterations. INFO 2022-02-02 10:40:23,765 trainer_main.py: 270: Total 500 samples in one epoch INFO 2022-02-02 10:40:23,765 trainer_main.py: 276: Total 30 iterations for training INFO 2022-02-02 10:40:23,863 logger.py: 84: Wed Feb 2 10:40:23 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.46 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 | | N/A 57C P0 73W / 149W | 564MiB / 11441MiB | 9% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
INFO 2022-02-02 10:40:23,863 profiler.py: 58: Entering profiler... INFO 2022-02-02 10:40:24,203 trainer_main.py: 173: Model is: Classy <class 'vissl.models.base_ssl_model.BaseSSLMultiInputOutputModel'>: BaseSSLMultiInputOutputModel( (_heads): ModuleDict() (trunk): ResNeXt( (_feature_blocks): ModuleDict( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv1_relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), bias=False) (1): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (flatten): Flatten() ) ) (heads): ModuleList( (0): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=2048, bias=True) (1): ReLU(inplace=True) ) ) (1): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=128, bias=True) ) ) ) ) INFO 2022-02-02 10:40:24,206 trainer_main.py: 174: Loss is: { 'info_average': { 'dist_rank': 0, 'name': 'SimclrInfoNCECriterion', 'num_negatives': 62, 'num_pos': 2, 'temperature': 0.1}, 'name': 'SimclrInfoNCELoss'} INFO 2022-02-02 10:40:24,206 trainer_main.py: 175: Starting training.... INFO 2022-02-02 10:40:24,206 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
INFO 2022-02-02 10:40:52,216 tensorboard_hook.py: 236: Logging Parameter gradients. Iteration 0 INFO 2022-02-02 10:40:55,701 tensorboard_hook.py: 255: Logging metrics. Iteration 0 INFO 2022-02-02 10:40:55,706 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 0; lr: 0.6; loss: 4.14865; btime(ms): 0; eta: 0:00:00; peak_mem(M): 8008; INFO 2022-02-02 10:40:57,472 tensorboard_hook.py: 255: Logging metrics. Iteration 1 INFO 2022-02-02 10:40:57,475 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 1; lr: 2.0; loss: 4.28503; btime(ms): 31942; eta: 0:15:26; peak_mem(M): 8008; max_iterations: 30; INFO 2022-02-02 10:40:59,249 tensorboard_hook.py: 255: Logging metrics. Iteration 2 INFO 2022-02-02 10:40:59,251 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 2; lr: 3.4; loss: 4.20582; btime(ms): 16855; eta: 0:07:51; peak_mem(M): 8008; INFO 2022-02-02 10:41:01,009 tensorboard_hook.py: 255: Logging metrics. Iteration 3 INFO 2022-02-02 10:41:01,011 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 3; lr: 3.59685; loss: 4.2537; btime(ms): 11829; eta: 0:05:19; peak_mem(M): 8008; INFO 2022-02-02 10:41:02,793 tensorboard_hook.py: 255: Logging metrics. Iteration 4 INFO 2022-02-02 10:41:02,795 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 4; lr: 3.48896; loss: 4.14138; btime(ms): 9311; eta: 0:04:02; peak_mem(M): 8008; INFO 2022-02-02 10:41:04,538 tensorboard_hook.py: 255: Logging metrics. Iteration 5 INFO 2022-02-02 10:41:04,540 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 5; lr: 3.17827; loss: 4.15546; btime(ms): 7806; eta: 0:03:15; peak_mem(M): 8008; btime(5iters)(ms): 7806; rolling_eta: 0:03:15; INFO 2022-02-02 10:41:06,290 tensorboard_hook.py: 255: Logging metrics. Iteration 6 INFO 2022-02-02 10:41:06,293 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 6; lr: 2.70209; loss: 4.15095; btime(ms): 6796; eta: 0:02:43; peak_mem(M): 8008; btime(5iters)(ms): 1766; rolling_eta: 0:00:42; INFO 2022-02-02 10:41:08,043 tensorboard_hook.py: 255: Logging metrics. Iteration 7 INFO 2022-02-02 10:41:08,047 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 7; lr: 2.11763; loss: 4.14471; btime(ms): 6075; eta: 0:02:19; peak_mem(M): 8008; btime(5iters)(ms): 1763; rolling_eta: 0:00:40; INFO 2022-02-02 10:41:09,842 tensorboard_hook.py: 255: Logging metrics. Iteration 8 INFO 2022-02-02 10:41:09,844 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 8; lr: 1.49511; loss: 4.14161; btime(ms): 5535; eta: 0:02:01; peak_mem(M): 8008; btime(5iters)(ms): 1759; rolling_eta: 0:00:38; INFO 2022-02-02 10:41:10,360 log_hooks.py: 568: Average train batch time (ms) for 10 batches: 3837 INFO 2022-02-02 10:41:10,361 log_hooks.py: 577: Train step time breakdown (rank 0): Timer Host CudaEvent read_sample: 0.87 ms 0.90 ms forward: 590.73 ms 1017.52 ms loss_compute: 1.87 ms 1.85 ms loss_all_reduce: 0.18 ms 0.19 ms backward: 709.24 ms 1630.28 ms optimizer_step: 1080.57 ms 160.52 ms train_step_total: 3616.21 ms 3616.25 ms INFO 2022-02-02 10:41:11,583 tensorboard_hook.py: 255: Logging metrics. Iteration 9 INFO 2022-02-02 10:45:47,389 profiler.py: 63: Exiting profiler... INFO 2022-02-02 10:45:47,390 profiler.py: 47: Exporting cpu_time to: /content/vissl/cpu_time_rank0.txt INFO 2022-02-02 10:45:47,391 profiler.py: 47: Exporting cuda_time to: /content/vissl/cuda_time_rank0.txt INFO 2022-02-02 10:45:47,391 profiler.py: 47: Exporting cuda_memory_usage to: /content/vissl/cuda_memory_usage_rank0.txt INFO 2022-02-02 10:45:47,392 profiler.py: 54: Exporting profiling chrome traces to: /content/vissl/profiler_chrome_trace_rank0.json INFO 2022-02-02 10:45:47,397 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 9; lr: 0.90932; loss: 4.14621; btime(ms): 5120; eta: 0:01:47; peak_mem(M): 8008; btime(5iters)(ms): 1766; rolling_eta: 0:00:37; INFO 2022-02-02 10:45:50,226 tensorboard_hook.py: 255: Logging metrics. Iteration 10 INFO 2022-02-02 10:45:50,232 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 10; lr: 0.43064; loss: 4.14273; btime(ms): 32363; eta: 0:10:47; peak_mem(M): 8008; btime(5iters)(ms): 56920; rolling_eta: 0:18:58; INFO 2022-02-02 10:45:51,921 tensorboard_hook.py: 255: Logging metrics. Iteration 11 INFO 2022-02-02 10:45:51,925 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 11; lr: 0.11656; loss: 4.13945; btime(ms): 29678; eta: 0:09:23; peak_mem(M): 8008; btime(5iters)(ms): 57138; rolling_eta: 0:18:05; INFO 2022-02-02 10:45:53,609 tensorboard_hook.py: 255: Logging metrics. Iteration 12 INFO 2022-02-02 10:45:53,614 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 12; lr: 0.00484; loss: 4.15122; btime(ms): 27346; eta: 0:08:12; peak_mem(M): 8008; btime(5iters)(ms): 57126; rolling_eta: 0:17:08; INFO 2022-02-02 10:45:55,288 tensorboard_hook.py: 255: Logging metrics. Iteration 13 INFO 2022-02-02 10:45:55,292 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 13; lr: 0.03928; loss: 4.14357; btime(ms): 25373; eta: 0:07:11; peak_mem(M): 8008; btime(5iters)(ms): 57113; rolling_eta: 0:16:10; INFO 2022-02-02 10:45:57,033 tensorboard_hook.py: 255: Logging metrics. Iteration 14 INFO 2022-02-02 10:45:57,038 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 14; lr: 0.1411; loss: 4.14553; btime(ms): 23680; eta: 0:06:18; peak_mem(M): 8008; btime(5iters)(ms): 57089; rolling_eta: 0:15:13; INFO 2022-02-02 10:45:57,039 trainer_main.py: 214: Meters synced INFO 2022-02-02 10:45:57,039 log_hooks.py: 568: Average train batch time (ms) for 15 batches: 21670 INFO 2022-02-02 10:45:57,039 log_hooks.py: 577: Train step time breakdown (rank 0): Timer Host CudaEvent read_sample: 4.65 ms 4.69 ms forward: 477.04 ms 925.43 ms loss_compute: 1.63 ms 1.61 ms loss_all_reduce: 0.16 ms 0.17 ms backward: 437.75 ms 1408.87 ms optimizer_step: 1106.35 ms 136.86 ms train_step_total: 21315.80 ms 21316.05 ms /usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:263: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. FutureWarning) /usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:289: FutureWarning: torch.cuda.reset_max_memory_cached now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. FutureWarning) INFO 2022-02-02 10:46:04,016 io.py: 63: Saving data to file: /content/checkpoints/metrics.json INFO 2022-02-02 10:46:04,017 io.py: 89: Saved data to file: /content/checkpoints/metrics.json INFO 2022-02-02 10:46:04,017 log_hooks.py: 426: [phase: 0] Saving checkpoint to /content/checkpoints INFO 2022-02-02 10:46:04,612 checkpoint.py: 131: Saved checkpoint: /content/checkpoints/model_phase0.torch INFO 2022-02-02 10:46:04,612 checkpoint.py: 140: Creating symlink... INFO 2022-02-02 10:46:04,613 checkpoint.py: 144: Created symlink: /content/checkpoints/checkpoint.torch INFO 2022-02-02 10:46:04,613 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 1, 'num_samples': 500, 'total_size': 500, 'shuffle': True, 'seed': 0} fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
fvcore version of PathManager will be deprecated soon. Please migrate to the version in iopath repo. https://github.com/facebookresearch/iopath
INFO 2022-02-02 10:46:13,134 trainer_main.py: 333: Phase advanced. Rank: 0 INFO 2022-02-02 10:46:13,135 state_update_hooks.py: 115: Starting phase 1 [train] INFO 2022-02-02 10:46:15,175 tensorboard_hook.py: 255: Logging metrics. Iteration 15 INFO 2022-02-02 10:46:15,184 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 15; lr: 0.29803; loss: 4.14155; btime(ms): 22218; eta: 0:05:33; peak_mem(M): 5676; btime(5iters)(ms): 1928; rolling_eta: 0:00:28; INFO 2022-02-02 10:46:16,944 tensorboard_hook.py: 255: Logging metrics. Iteration 16 INFO 2022-02-02 10:46:16,949 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 16; lr: 0.49122; loss: 4.14258; btime(ms): 21963; eta: 0:05:07; peak_mem(M): 5676; btime(5iters)(ms): 4990; rolling_eta: 0:01:09; INFO 2022-02-02 10:46:18,640 tensorboard_hook.py: 255: Logging metrics. Iteration 17 INFO 2022-02-02 10:46:18,645 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 17; lr: 0.69747; loss: 4.14387; btime(ms): 20775; eta: 0:04:30; peak_mem(M): 5676; btime(5iters)(ms): 5004; rolling_eta: 0:01:05; INFO 2022-02-02 10:46:20,348 tensorboard_hook.py: 255: Logging metrics. Iteration 18 INFO 2022-02-02 10:46:20,353 log_hooks.py: 277: Rank: 0; [ep: 1] iter: 18; lr: 0.89198; loss: 4.1382; btime(ms): 19715; eta: 0:03:56; peak_mem(M): 5676; btime(5iters)(ms): 5006; rolling_eta: 0:01:00; INFO 2022-02-02 10:46:22,029 tensorboard_hook.py: 255: Logging metrics. Iteration 19 --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step task.run_hooks(SSLClassyHookFunctions.on_update.name) File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks getattr(hook, hook_function_name, ClassyHook._noop)(self) File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update self._runtime_tracking(iteration, task) File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking self.runtime_profiler.step() File "/content/vissl/vissl/utils/profiler.py", line 32, in step self.profiler.step() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step self._stop_trace() File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace self.profiler.exit(None, None, None) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit self.kineto_results = torch.autograd._disable_profiler() RuntimeError: Expected debug info of type 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 57, in
hydra_main(overrides=overrides)
File "tools/run_distributed_engines.py", line 46, in hydra_main
hook_generator=default_hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed
logging.error("Wrapping up, caught exception: ", e)
Message: 'Wrapping up, caught exception: '
Arguments: (RuntimeError('Expected debug info of type 2'),)
--- Logging error ---
Traceback (most recent call last):
File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed
hook_generator=hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 46, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 130, in train_main
trainer.train()
File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train
raise e
File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train
task = train_step_fn(task)
File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step
task.run_hooks(SSLClassyHookFunctions.on_update.name)
File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks
getattr(hook, hook_function_name, ClassyHook._noop)(self)
File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update
self._runtime_tracking(iteration, task)
File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking
self.runtime_profiler.step()
File "/content/vissl/vissl/utils/profiler.py", line 32, in step
self.profiler.step()
File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step
self._stop_trace()
File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace
self.profiler.exit(None, None, None)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit
self.kineto_results = torch.autograd._disable_profiler()
RuntimeError: Expected debug info of type 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 57, in
hydra_main(overrides=overrides)
File "tools/run_distributed_engines.py", line 46, in hydra_main
hook_generator=default_hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed
logging.error("Wrapping up, caught exception: ", e)
Message: 'Wrapping up, caught exception: '
Arguments: (RuntimeError('Expected debug info of type 2'),)
Traceback (most recent call last):
File "tools/run_distributed_engines.py", line 57, in
hydra_main(overrides=overrides)
File "tools/run_distributed_engines.py", line 46, in hydra_main
hook_generator=default_hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 164, in launch_distributed
raise e
File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed
hook_generator=hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 46, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 130, in train_main
trainer.train()
File "/content/vissl/vissl/trainer/trainer_main.py", line 211, in train
raise e
File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train
task = train_step_fn(task)
File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 228, in standard_train_step
task.run_hooks(SSLClassyHookFunctions.on_update.name)
File "/content/vissl/vissl/trainer/train_task.py", line 669, in run_hooks
getattr(hook, hook_function_name, ClassyHook._noop)(self)
File "/content/vissl/vissl/hooks/profiling_hook.py", line 113, in on_update
self._runtime_tracking(iteration, task)
File "/content/vissl/vissl/hooks/profiling_hook.py", line 124, in _runtime_tracking
self.runtime_profiler.step()
File "/content/vissl/vissl/utils/profiler.py", line 32, in step
self.profiler.step()
File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 282, in step
self._stop_trace()
File "/usr/local/lib/python3.7/dist-packages/torch/profiler/profiler.py", line 372, in _stop_trace
self.profiler.exit(None, None, None)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/profiler.py", line 500, in exit
self.kineto_results = torch.autograd._disable_profiler()
RuntimeError: Expected debug info of type 2