Closed muzi-8 closed 3 years ago
Hi @muzi-8 -- what is the hydra version?
pip list
Can you make sure it is <=1.0.7?
BTW -- we are in process of fixing these tutorials.
Hey @iseessel, Thanks for the help. I was having the same issue on a different tutorial notebook and downgrading hydra solved the issue.
Hi @iseessel , Thanks very much! The problem is indeed caused by the improper version of the hydra library. It is recommended that the official document can explain the operating environment more clearly Have a nice day!
At least one issue with using hydra-core 1.1.1 (which is installed by default if you follow the installation guide - install.md) seems to be that the TRANSFORMS are not loaded and end up empty.
for example from the tutorial:
...
'TRAIN': {'BATCHSIZE_PER_REPLICA': 2,
'COLLATE_FUNCTION': 'default_collate',
'COLLATE_FUNCTION_PARAMS': {},
'COPY_DESTINATION_DIR': '',
'COPY_TO_LOCAL_DISK': False,
'DATASET_NAMES': ['dummy_data_folder'],
'DATA_LIMIT': -1,
'DATA_PATHS': [],
'DATA_SOURCES': ['disk_folder'],
'DEFAULT_GRAY_IMG_SIZE': 224,
'DROP_LAST': False,
'ENABLE_QUEUE_DATASET': False,
'INPUT_KEY_NAMES': ['data'],
'LABEL_PATHS': [],
'LABEL_SOURCES': ['disk_folder'],
'LABEL_TYPE': 'standard',
'MMAP_MODE': True,
'TARGET_KEY_NAMES': ['label'],
'TRANSFORMS': [], <<<<<<<<<<<<<<<<<<
'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}},
Downgrading to 1.0.7 fixes this issue for me.
Hi yes, we don't support Hydra 1.1 currently -- we will have support for it in our upcoming release, including fixing the tutorials.
Please use 1.0.7 for now! Sorry for the inconvenience everyone :).
Hi @muzi-8 @matteopilotto @pdmct we have released a new version of VISSL, along with fully-functioning tutorials!
Please see: https://vissl.ai/tutorials/Understanding_VISSL_Training_and_YAML_Config_V0_1_6.
Thank you!
📚 VISSL Documentation
_self_
. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information warnings.warn(msg, UserWarning) INFO 2021-09-09 07:45:35,261 init.py: 32: Provided Config has latest version: 1 INFO 2021-09-09 07:45:35,262 run_distributed_engines.py: 163: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:55674 INFO 2021-09-09 07:45:35,262 train.py: 66: Env set for rank: 0, dist_rank: 0 INFO 2021-09-09 07:45:35,263 env.py: 41: BASH_ENV: /etc/bash.bashrc INFO 2021-09-09 07:45:35,263 env.py: 41: CLICOLOR: 1 INFO 2021-09-09 07:45:35,263 env.py: 41: COCOAPI_VERSION: 2.0+nv0.4.0 INFO 2021-09-09 07:45:35,263 env.py: 41: CUBLAS_VERSION: 11.2.1.74 INFO 2021-09-09 07:45:35,263 env.py: 41: CUDA_CACHE_DISABLE: 1 INFO 2021-09-09 07:45:35,263 env.py: 41: CUDA_DRIVER_VERSION: 455.23.05 INFO 2021-09-09 07:45:35,263 env.py: 41: CUDA_VERSION: 11.1.0.024 INFO 2021-09-09 07:45:35,263 env.py: 41: CUDNN_VERSION: 8.0.4.30 INFO 2021-09-09 07:45:35,263 env.py: 41: CUFFT_VERSION: 10.3.0.74 INFO 2021-09-09 07:45:35,263 env.py: 41: CURAND_VERSION: 10.2.2.74 INFO 2021-09-09 07:45:35,263 env.py: 41: CUSOLVER_VERSION: 11.0.0.74 INFO 2021-09-09 07:45:35,263 env.py: 41: CUSPARSE_VERSION: 11.2.0.275 INFO 2021-09-09 07:45:35,263 env.py: 41: DALI_BUILD: 1608709 INFO 2021-09-09 07:45:35,263 env.py: 41: DALI_VERSION: 0.26.0 INFO 2021-09-09 07:45:35,263 env.py: 41: DLPROF_VERSION: 20.10 INFO 2021-09-09 07:45:35,263 env.py: 41: ENV: /etc/shinit_v2 INFO 2021-09-09 07:45:35,263 env.py: 41: GIT_PAGER: cat INFO 2021-09-09 07:45:35,264 env.py: 41: HOME: /root INFO 2021-09-09 07:45:35,264 env.py: 41: HOSTNAME: 45c76449232e INFO 2021-09-09 07:45:35,264 env.py: 41: JPY_PARENT_PID: 35131 INFO 2021-09-09 07:45:35,264 env.py: 41: JUPYTER_PORT: 8888 INFO 2021-09-09 07:45:35,264 env.py: 41: LC_ALL: C.UTF-8 INFO 2021-09-09 07:45:35,264 env.py: 41: LD_LIBRARY_PATH: /usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 INFO 2021-09-09 07:45:35,264 env.py: 41: LESSCLOSE: /usr/bin/lesspipe %s %s INFO 2021-09-09 07:45:35,264 env.py: 41: LESSOPEN:INFO 2021-09-09 07:45:35,266 env.py: 41: TRT_VERSION: 7.2.1.4 INFO 2021-09-09 07:45:35,266 env.py: 41: WORLDSIZE: 1 INFO 2021-09-09 07:45:35,266 env.py: 41: : /opt/conda/bin/python3 INFO 2021-09-09 07:45:35,266 env.py: 41: _CUDA_COMPAT_PATH: /usr/local/cuda/compat INFO 2021-09-09 07:45:35,266 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:45:35,266 train.py: 77: Setting seed.... INFO 2021-09-09 07:45:35,266 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:45:35,911 hydra_config.py: 140: Training with config: INFO 2021-09-09 07:45:35,924 hydra_config.py: 144: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': './checkpoints', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': False, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'LAYER_NAME': ''}, 'NUM_CLUSTERS': 16000, 'N_ITER': 50}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 4, 'PIN_MEMORY': True, 'TEST': {'BATCHSIZE_PER_REPLICA': 2, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': -1, 'DATA_PATHS': ['./dummy_data/val'], 'DATA_SOURCES': ['disk_folder'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': ['disk_folder'], 'LABEL_TYPE': 'standard', 'MMAP_MODE': True, 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BATCHSIZE_PER_REPLICA': 2, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': -1, 'DATA_PATHS': ['./dummy_data/train'], 'DATA_SOURCES': ['disk_folder'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': ['disk_folder'], 'LABEL_TYPE': 'standard', 'MMAP_MODE': True, 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'IMG_RETRIEVAL': {'DATASET_PATH': '', 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SHOULD_TRAIN_PCA_OR_WHITENING': True, 'SPATIAL_LEVELS': 3, 'TEMP_DIR': '/tmp/instance_retrieval/', 'TRAIN_DATASET_NAME': 'Oxford', 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 10, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'CrossEntropyLoss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'name': ''}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O1'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': False, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'NAME': 'resnet', 'TRUNK_PARAMS': {'EFFICIENT_NETS': {}, 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}}, 'MONITOR_PERF_STATS': False, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 0.0001}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'num_epochs': 2, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.1}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [1], 'name': 'multistep', 'schedulers': [], 'start_value': 0.1, 'update_interval': 'epoch', 'value': 0.1, 'values': [0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.1}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [1], 'name': 'multistep', 'schedulers': [], 'start_value': 0.1, 'update_interval': 'epoch', 'value': 0.1, 'values': [0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': False, 'weight_decay': 0.0001}, 'PERF_STAT_FREQUENCY': -1, 'ROLLING_BTIME_FREQ': -1, 'SEED_VALUE': 0, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': True}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': True, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2021-09-09 07:45:36,905 train.py: 89: System config:
PyTorch built with:
CPU info:
Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 76 On-line CPU(s) list 0-75 Thread(s) per core 1 Core(s) per socket 38 Socket(s) 2 NUMA node(s) 2 Vendor ID GenuineIntel CPU family 6 Model 106 Model name Genuine Intel(R) CPU $0000%@ Stepping 6 CPU MHz 3400.000 CPU max MHz 3400.0000 CPU min MHz 800.0000 BogoMIPS 4400.00 Virtualization VT-x L1d cache 48K L1i cache 32K L2 cache 1280K L3 cache 58368K NUMA node0 CPU(s) 0-37 NUMA node1 CPU(s) 38-75
INFO 2021-09-09 07:45:36,906 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:45:36,907 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:45:36,908 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:45:36,908 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:45:36,908 trainer_main.py: 109: Using Distributed init method: tcp://localhost:55674, world_size: 1, rank: 0 INFO 2021-09-09 07:45:36,909 trainer_main.py: 130: | initialized host 45c76449232e as rank 0 (0) INFO 2021-09-09 07:45:36,909 ssl_dataset.py: 130: Rank: 0 split: TEST Data files: ['./dummy_data/val'] INFO 2021-09-09 07:45:36,909 ssl_dataset.py: 133: Rank: 0 split: TEST Label files: ['./dummy_data/val'] INFO 2021-09-09 07:45:36,910 disk_dataset.py: 81: Loaded 12 samples from folder ./dummy_data/val INFO 2021-09-09 07:45:36,910 ssl_dataset.py: 130: Rank: 0 split: TRAIN Data files: ['./dummy_data/train'] INFO 2021-09-09 07:45:36,910 ssl_dataset.py: 133: Rank: 0 split: TRAIN Label files: ['./dummy_data/train'] INFO 2021-09-09 07:45:36,910 disk_dataset.py: 81: Loaded 12 samples from folder ./dummy_data/train INFO 2021-09-09 07:45:36,910 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:45:36,910 init.py: 91: Created the Distributed Sampler.... INFO 2021-09-09 07:45:36,911 init.py: 72: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 12, 'total_size': 12, 'shuffle': True, 'seed': 0} INFO 2021-09-09 07:45:36,911 init.py: 155: Wrapping the dataloader to async device copies INFO 2021-09-09 07:45:38,228 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:45:38,228 init.py: 91: Created the Distributed Sampler.... INFO 2021-09-09 07:45:38,228 init.py: 72: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 12, 'total_size': 12, 'shuffle': True, 'seed': 0} INFO 2021-09-09 07:45:38,228 init.py: 155: Wrapping the dataloader to async device copies INFO 2021-09-09 07:45:38,228 train_task.py: 419: Building model.... INFO 2021-09-09 07:45:38,229 resnext.py: 63: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2021-09-09 07:45:38,229 resnext.py: 83: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2021-09-09 07:45:38,783 train_task.py: 591: Broadcast model BN buffers from master on every forward pass INFO 2021-09-09 07:45:38,783 classification_task.py: 359: Synchronized Batch Normalization is disabled INFO 2021-09-09 07:45:38,783 train_task.py: 340: Building loss... INFO 2021-09-09 07:45:38,810 optimizer_helper.py: 157: Trainable params: 159, Non-Trainable params: 0, Trunk Regularized Parameters: 53, Trunk Unregularized Parameters 106, Head Regularized Parameters: 0, Head Unregularized Parameters: 0 Remaining Regularized Parameters: 0 INFO 2021-09-09 07:45:38,813 trainer_main.py: 241: Training 2 epochs. One epoch = 6 iterations INFO 2021-09-09 07:45:38,813 trainer_main.py: 243: Total 12 iterations for training INFO 2021-09-09 07:45:38,813 trainer_main.py: 244: Total 12 samples in one epoch INFO 2021-09-09 07:45:40,148 logger.py: 76: Thu Sep 9 07:45:38 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB Off | 00000000:0E:00.0 Off | Off | | N/A 32C P0 61W / 400W | 1082MiB / 40536MiB | 2% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 A100-SXM4-40GB Off | 00000000:0F:00.0 Off | 0 | | N/A 31C P0 52W / 400W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 A100-SXM4-40GB Off | 00000000:1F:00.0 Off | 0 | | N/A 32C P0 54W / 400W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 A100-SXM4-40GB Off | 00000000:20:00.0 Off | 0 | | N/A 32C P0 52W / 400W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 A100-SXM4-40GB Off | 00000000:B5:00.0 Off | 0 | | N/A 29C P0 51W / 400W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 A100-SXM4-40GB Off | 00000000:B6:00.0 Off | 0 | | N/A 30C P0 56W / 400W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 A100-SXM4-40GB Off | 00000000:CE:00.0 Off | 0 | | N/A 30C P0 51W / 400W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 A100-SXM4-40GB Off | 00000000:CF:00.0 Off | 0 | | N/A 30C P0 50W / 400W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
INFO 2021-09-09 07:45:40,154 trainer_main.py: 166: Model is: Classy <class 'vissl.models.base_ssl_model.BaseSSLMultiInputOutputModel'>: BaseSSLMultiInputOutputModel( (_heads): ModuleDict() (trunk): ResNeXt( (_feature_blocks): ModuleDict( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv1_relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), bias=False) (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (flatten): Flatten() ) ) (heads): ModuleList() ) INFO 2021-09-09 07:45:40,156 trainer_main.py: 167: Loss is: CrossEntropyLoss() INFO 2021-09-09 07:45:40,156 trainer_main.py: 168: Starting training.... INFO 2021-09-09 07:45:40,156 init.py: 72: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 12, 'total_size': 12, 'shuffle': True, 'seed': 0} Traceback (most recent call last): File "run_distributed_engines.py", line 194, in
hydra_main(overrides=overrides)
File "run_distributed_engines.py", line 179, in hydra_main
hook_generator=default_hook_generator,
File "run_distributed_engines.py", line 123, in launch_distributed
hook_generator=hook_generator,
File "run_distributed_engines.py", line 166, in _distributed_worker
process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
File "run_distributed_engines.py", line 159, in process_main
hook_generator=hook_generator,
File "/opt/conda/lib/python3.6/site-packages/vissl/engines/train.py", line 102, in train_main
trainer.train()
File "/opt/conda/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 171, in train
self._advance_phase(task) # advances task.phase_idx
File "/opt/conda/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 286, in _advance_phase
phase_type, epoch=task.phase_idx, compute_start_iter=compute_start_iter
File "/opt/conda/lib/python3.6/site-packages/vissl/trainer/train_task.py", line 501, in recreate_data_iterator
self.data_iterator = iter(self.dataloaders[phase_type])
File "/opt/conda/lib/python3.6/site-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 40, in iter
self.preload()
File "/opt/conda/lib/python3.6/site-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 46, in preload
self.cache_next = next(self._iter)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 195, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 73, in
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
return [default_collate(samples) for samples in transposed]
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in
return [default_collate(samples) for samples in transposed]
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 85, in default_collate
raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.Image.Image'>
_self_
. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information warnings.warn(msg, UserWarning) INFO 2021-09-09 07:21:39,264 init.py: 32: Provided Config has latest version: 1 INFO 2021-09-09 07:21:41,267 train.py: 66: Env set for rank: 5, dist_rank: 5 INFO 2021-09-09 07:21:41,267 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,267 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,267 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:41,270 train.py: 66: Env set for rank: 2, dist_rank: 2 INFO 2021-09-09 07:21:41,270 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,270 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,270 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:41,271 train.py: 66: Env set for rank: 1, dist_rank: 1 INFO 2021-09-09 07:21:41,271 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,271 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,271 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:41,279 train.py: 66: Env set for rank: 0, dist_rank: 0 INFO 2021-09-09 07:21:41,279 env.py: 41: BASH_ENV: /etc/bash.bashrc INFO 2021-09-09 07:21:41,279 env.py: 41: CLICOLOR: 1 INFO 2021-09-09 07:21:41,279 env.py: 41: COCOAPI_VERSION: 2.0+nv0.4.0 INFO 2021-09-09 07:21:41,279 env.py: 41: CUBLAS_VERSION: 11.2.1.74 INFO 2021-09-09 07:21:41,279 env.py: 41: CUDA_CACHE_DISABLE: 1 INFO 2021-09-09 07:21:41,279 env.py: 41: CUDA_DRIVER_VERSION: 455.23.05 INFO 2021-09-09 07:21:41,279 env.py: 41: CUDA_VERSION: 11.1.0.024 INFO 2021-09-09 07:21:41,279 env.py: 41: CUDNN_VERSION: 8.0.4.30 INFO 2021-09-09 07:21:41,279 env.py: 41: CUFFT_VERSION: 10.3.0.74 INFO 2021-09-09 07:21:41,279 env.py: 41: CURAND_VERSION: 10.2.2.74 INFO 2021-09-09 07:21:41,279 env.py: 41: CUSOLVER_VERSION: 11.0.0.74 INFO 2021-09-09 07:21:41,279 env.py: 41: CUSPARSE_VERSION: 11.2.0.275 INFO 2021-09-09 07:21:41,280 env.py: 41: DALI_BUILD: 1608709 INFO 2021-09-09 07:21:41,280 env.py: 41: DALI_VERSION: 0.26.0 INFO 2021-09-09 07:21:41,280 env.py: 41: DLPROF_VERSION: 20.10 INFO 2021-09-09 07:21:41,280 env.py: 41: ENV: /etc/shinit_v2 INFO 2021-09-09 07:21:41,280 env.py: 41: GIT_PAGER: cat INFO 2021-09-09 07:21:41,280 env.py: 41: HOME: /root INFO 2021-09-09 07:21:41,280 env.py: 41: HOSTNAME: 45c76449232e INFO 2021-09-09 07:21:41,280 env.py: 41: JPY_PARENT_PID: 35131 INFO 2021-09-09 07:21:41,280 env.py: 41: JUPYTER_PORT: 8888 INFO 2021-09-09 07:21:41,280 env.py: 41: LC_ALL: C.UTF-8 INFO 2021-09-09 07:21:41,280 env.py: 41: LD_LIBRARY_PATH: /usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 INFO 2021-09-09 07:21:41,280 env.py: 41: LESSCLOSE: /usr/bin/lesspipe %s %s INFO 2021-09-09 07:21:41,280 env.py: 41: LESSOPEN:INFO 2021-09-09 07:21:41,282 env.py: 41: TRT_VERSION: 7.2.1.4 INFO 2021-09-09 07:21:41,282 env.py: 41: WORLDSIZE: 8 INFO 2021-09-09 07:21:41,282 env.py: 41: : /opt/conda/bin/python3 INFO 2021-09-09 07:21:41,282 env.py: 41: _CUDA_COMPAT_PATH: /usr/local/cuda/compat INFO 2021-09-09 07:21:41,282 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,282 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,282 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:41,307 train.py: 66: Env set for rank: 6, dist_rank: 6 INFO 2021-09-09 07:21:41,307 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,307 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,307 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:41,317 train.py: 66: Env set for rank: 7, dist_rank: 7 INFO 2021-09-09 07:21:41,317 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,317 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,317 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:41,318 train.py: 66: Env set for rank: 4, dist_rank: 4 INFO 2021-09-09 07:21:41,318 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,318 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,318 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:41,353 train.py: 66: Env set for rank: 3, dist_rank: 3 INFO 2021-09-09 07:21:41,353 misc.py: 86: Set start method of multiprocessing to forkserver INFO 2021-09-09 07:21:41,353 train.py: 77: Setting seed.... INFO 2021-09-09 07:21:41,353 misc.py: 99: MACHINE SEED: 0 INFO 2021-09-09 07:21:42,589 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:42,594 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:42,594 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:42,594 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:42,595 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 1 INFO 2021-09-09 07:21:42,737 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:42,740 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:42,740 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:42,741 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:42,742 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 4 INFO 2021-09-09 07:21:42,797 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:42,802 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:42,802 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:42,803 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:42,804 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 3 INFO 2021-09-09 07:21:42,928 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:42,930 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:42,930 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:42,930 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:42,931 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 5 INFO 2021-09-09 07:21:42,948 hydra_config.py: 140: Training with config: INFO 2021-09-09 07:21:42,954 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:42,957 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:42,957 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:42,958 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:42,958 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 2 INFO 2021-09-09 07:21:42,960 hydra_config.py: 144: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': './checkpoints', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': False, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'LAYER_NAME': ''}, 'NUM_CLUSTERS': 16000, 'N_ITER': 50}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 4, 'PIN_MEMORY': True, 'TEST': {'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 8, 'RUN_ID': 'auto'}, 'IMG_RETRIEVAL': {'DATASET_PATH': '', 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SHOULD_TRAIN_PCA_OR_WHITENING': True, 'SPATIAL_LEVELS': 3, 'TEMP_DIR': '/tmp/instance_retrieval/', 'TRAIN_DATASET_NAME': 'Oxford', 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 10, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'CrossEntropyLoss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'name': ''}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O1'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': False, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'NAME': 'resnet', 'RESNETS': {'DEPTH': 50}, 'TRUNK_PARAMS': {'EFFICIENT_NETS': {}, 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}}, 'MONITOR_PERF_STATS': False, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 0.0001}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'num_epochs': 90, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.1}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [30, 60], 'name': 'multistep', 'schedulers': [], 'start_value': 0.1, 'update_interval': 'epoch', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.1}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [30, 60], 'name': 'multistep', 'schedulers': [], 'start_value': 0.1, 'update_interval': 'epoch', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': False, 'weight_decay': 0.0001}, 'PERF_STAT_FREQUENCY': -1, 'ROLLING_BTIME_FREQ': -1, 'SEED_VALUE': 0, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': True}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': True, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2021-09-09 07:21:42,970 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:42,972 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:42,974 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:42,974 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:42,975 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:42,975 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:42,975 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:42,975 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 6 INFO 2021-09-09 07:21:42,976 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:42,977 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 7 INFO 2021-09-09 07:21:43,838 train.py: 89: System config:
PyTorch built with:
CPU info:
Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 76 On-line CPU(s) list 0-75 Thread(s) per core 1 Core(s) per socket 38 Socket(s) 2 NUMA node(s) 2 Vendor ID GenuineIntel CPU family 6 Model 106 Model name Genuine Intel(R) CPU $0000%@ Stepping 6 CPU MHz 1200.000 CPU max MHz 3400.0000 CPU min MHz 800.0000 BogoMIPS 4400.00 Virtualization VT-x L1d cache 48K L1i cache 32K L2 cache 1280K L3 cache 58368K NUMA node0 CPU(s) 0-37 NUMA node1 CPU(s) 38-75
INFO 2021-09-09 07:21:43,838 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs INFO 2021-09-09 07:21:43,840 tensorboard_hook.py: 61: Setting up SSL Tensorboard Hook... INFO 2021-09-09 07:21:43,840 tensorboard_hook.py: 67: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True INFO 2021-09-09 07:21:43,841 train_task.py: 192: Not using Automatic Mixed Precision INFO 2021-09-09 07:21:43,841 trainer_main.py: 109: Using Distributed init method: tcp://localhost:41385, world_size: 8, rank: 0 INFO 2021-09-09 07:21:43,933 trainer_main.py: 130: | initialized host 45c76449232e as rank 5 (5) INFO 2021-09-09 07:21:43,961 trainer_main.py: 130: | initialized host 45c76449232e as rank 2 (2) INFO 2021-09-09 07:21:43,978 trainer_main.py: 130: | initialized host 45c76449232e as rank 6 (6) INFO 2021-09-09 07:21:43,980 trainer_main.py: 130: | initialized host 45c76449232e as rank 7 (7) Traceback (most recent call last): File "run_distributed_engines.py", line 194, in
hydra_main(overrides=overrides)
File "run_distributed_engines.py", line 179, in hydra_main
hook_generator=default_hook_generator,
File "run_distributed_engines.py", line 112, in launch_distributed
daemon=False,
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 7 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/workspace/run_distributed_engines.py", line 166, in _distributed_worker process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id) File "/workspace/run_distributed_engines.py", line 159, in process_main hook_generator=hook_generator, File "/opt/conda/lib/python3.6/site-packages/vissl/engines/train.py", line 102, in train_main trainer.train() File "/opt/conda/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 155, in train self.task.prepare(pin_memory=self.cfg.DATA.PIN_MEMORY) File "/opt/conda/lib/python3.6/site-packages/vissl/trainer/train_task.py", line 630, in prepare self.dataloaders = self.build_dataloaders(pin_memory=pin_memory) File "/opt/conda/lib/python3.6/site-packages/vissl/trainer/train_task.py", line 283, in build_dataloaders self.datasets, self.data_and_label_keys = self.build_datasets() File "/opt/conda/lib/python3.6/site-packages/vissl/trainer/train_task.py", line 273, in build_datasets datasets[split] = build_dataset(self.config, split) File "/opt/conda/lib/python3.6/site-packages/vissl/data/init.py", line 44, in build_dataset dataset = GenericSSLDataset(cfg, split, DATASET_SOURCE_MAP) File "/opt/conda/lib/python3.6/site-packages/vissl/data/ssl_dataset.py", line 78, in init self._get_data_files(split) File "/opt/conda/lib/python3.6/site-packages/vissl/data/ssl_dataset.py", line 126, in _get_data_files split, dataset_config=self.cfg["DATA"] File "/opt/conda/lib/python3.6/site-packages/vissl/data/dataset_catalog.py", line 258, in get_data_files ), "len(data_sources) != len(dataset_names)" AssertionError: len(data_sources) != len(dataset_names)