Model not training, training is stuck without errors

kulkarnivishal commented 2 years ago

Thank you for this repository

I'm trying to train Deep Cluster V2 on custom dataset that I successfully registered. However when I initiate training, the output logs are stuck at initialized host ... as rank 0. There's no error but no progress either.

Tried installing from source and installing through pip
I'm training the model in a Kubeflow notebook
Python 3.8 Pytorch 1.8.1 + cu 11.1
Tried training SimCLR on same data following the tutorial exactly and facing same issue
However training works successfully on CPU only Kubeflow notebook but unable to train Deep Cluster V2 and could only train the models for which cpu test configs are available

Please help. I've attached output logs below -

Instructions To Reproduce the Issue:

full code you wrote or full changes you made (git diff) No Changes

what exact command you run:

! python tools/run_distributed_engines.py config=test/integration_test/quick_deepcluster_v2.yaml \
config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
config.DATA.TRAIN.DATASET_NAMES=[sample_crowley_passport] \
config.DATA.TRAIN.DATA_PATHS=["/home/jovyan/Sampled/train"] \
config.DATA.TEST.DATA_SOURCES=[disk_folder] \
config.DATA.TEST.DATASET_NAMES=[sample_crowley_passport] \
config.DATA.TEST.DATA_PATHS=["/home/jovyan/Sampled/test"] \
config.CHECKPOINT.DIR="./checkpoints" \
config.OPTIMIZER.num_epochs=50 \
config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \
config.VERBOSE=true \
config.LOG_FREQUENCY=3

sample_crowley_passport is the registered custom dataset

full logs you observed:


** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath

####### overrides: ['config=test/integration_test/quick_deepcluster_v2.yaml', 'config.DATA.TRAIN.DATA_SOURCES=[disk_folder]', 'config.DATA.TRAIN.DATASET_NAMES=[sample_crowley_passport]', 'config.DATA.TRAIN.DATA_PATHS=[/home/jovyan/Sampled/train]', 'config.DATA.TEST.DATA_SOURCES=[disk_folder]', 'config.DATA.TEST.DATASET_NAMES=[sample_crowley_passport]', 'config.DATA.TEST.DATA_PATHS=[/home/jovyan/Sampled/test]', 'config.CHECKPOINT.DIR=./checkpoints', 'config.OPTIMIZER.num_epochs=50', 'config.DISTRIBUTED.NUM_PROC_PER_NODE=1', 'config.VERBOSE=true', 'config.LOG_FREQUENCY=3'] INFO 2022-05-16 18:52:36,186 distributed_launcher.py: 183: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:46849

<Bunch of port logs>

INFO 2022-05-16 18:52:36,191 env.py: 50: _: /opt/conda/bin/python INFO 2022-05-16 18:52:36,191 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2022-05-16 18:52:36,191 train.py: 105: Setting seed.... INFO 2022-05-16 18:52:36,191 misc.py: 173: MACHINE SEED: 0 INFO 2022-05-16 18:52:36,193 hydra_config.py: 132: Training with config: INFO 2022-05-16 18:52:36,201 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 5, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': './checkpoints', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['sample_crowley_passport'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': ['/home/jovyan/Sampled/test'], 'DATA_SOURCES': ['disk_folder'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 16, 'COLLATE_FUNCTION': 'multicrop_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k/', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['sample_crowley_passport'], 'DATA_LIMIT': 250, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': ['/home/jovyan/Sampled/train'], 'DATA_SOURCES': ['disk_folder'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'crop_scales': [[0.14, 1]], 'name': 'ImgPilToMultiCrop', 'num_crops': [2], 'size_crops': [224], 'total_num_crops': 2}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'DISTR_ON': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': True, 'PERF_STAT_FREQUENCY': 40, 'ROLLING_BTIME_FREQ': 5}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': False}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 3, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 16, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0, 1], 'embedding_dim': 128}, 'num_clusters': [10, 10, 10], 'num_crops': 2, 'num_train_samples': 500, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'deepclusterv2_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O1'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'skip_last_layer_relu_bn': False, 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 50, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': True, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [30, 60], 'name': 'cosine', 'schedulers': [], 'start_value': 0.01875, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': True, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [30, 60], 'name': 'cosine', 'schedulers': [], 'start_value': 0.01875, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': True, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': False}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': True} INFO 2022-05-16 18:52:36,801 train.py: 117: System config:

sys.platform linux Python 3.8.10	packaged by conda-forge	(default, May 11 2021, 07:01:05) [GCC 9.3.0] numpy 1.19.5 Pillow 9.0.1 vissl 0.1.6 @/home/jovyan/vissl/vissl GPU available True GPU 0 Tesla K80 CUDA_HOME /usr/local/cuda torchvision 0.9.1+cu101 @/opt/conda/lib/python3.8/site-packages/torchvision hydra 1.0.7 @/opt/conda/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/opt/conda/lib/python3.8/site-packages/classy_vision tensorboard 2.8.0 apex 0.1 @/opt/conda/lib/python3.8/site-packages/apex cv2 4.5.5 PyTorch 1.8.1+cu111 @/opt/conda/lib/python3.8/site-packages/torch PyTorch debug build False

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

CPU info:

Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 8 On-line CPU(s) list 0-7 Thread(s) per core 2 Core(s) per socket 4 Socket(s) 1 NUMA node(s) 1 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU @ 2.20GHz Stepping 0 CPU MHz 2199.998 BogoMIPS 4399.99 Hypervisor vendor KVM Virtualization type full L1d cache 128 KiB L1i cache 128 KiB L2 cache 1 MiB L3 cache 55 MiB NUMA node0 CPU(s) 0-7 Vulnerability Itlb multihit Not affected Vulnerability L1tf Mitigation; PTE Inversion Vulnerability Mds Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown Mitigation; PTI Vulnerability Spec store bypass Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1 Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2 Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds Not affected Vulnerability Tsx async abort Mitigation; Clear CPU buffers; SMT Host state unknown

INFO 2022-05-16 18:52:36,802 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46849, world_size: 1, rank: 0 INFO 2022-05-16 18:52:36,803 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0 INFO 2022-05-16 18:52:36,804 trainer_main.py: 130: | initialized host vishal-pyt-cu111-0 as rank 0 (0)

**Nothing after this**
## Expected behavior:

I'd expect the model to train or code to throw an error, if any. 

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

If you expect the model to converge / work better, note that we do not give suggestions
on how to train a new model.
Only in one of the two conditions, we will help with it:
(1) You're unable to reproduce the results in vissl model zoo.
(2) It indicates a vissl bug.

## Environment:

Provide your environment information using the following command:

sys.platform linux Python 3.8.10	packaged by conda-forge	(default, May 11 2021, 07:01:05) [GCC 9.3.0] numpy 1.19.5 Pillow 9.0.1 vissl 0.1.6 @/home/jovyan/vissl/vissl GPU available True GPU 0 Tesla K80 CUDA_HOME /usr/local/cuda torchvision 0.9.1+cu101 @/opt/conda/lib/python3.8/site-packages/torchvision hydra 1.0.7 @/opt/conda/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/opt/conda/lib/python3.8/site-packages/classy_vision tensorboard 2.8.0 apex 0.1 @/opt/conda/lib/python3.8/site-packages/apex cv2 4.5.5 PyTorch 1.8.1+cu111 @/opt/conda/lib/python3.8/site-packages/torch PyTorch debug build False

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

CPU info:

Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 8 On-line CPU(s) list 0-7 Thread(s) per core 2 Core(s) per socket 4 Socket(s) 1 NUMA node(s) 1 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU @ 2.20GHz Stepping 0 CPU MHz 2199.998 BogoMIPS 4399.99 Hypervisor vendor KVM Virtualization type full L1d cache 128 KiB L1i cache 128 KiB L2 cache 1 MiB L3 cache 55 MiB NUMA node0 CPU(s) 0-7 Vulnerability Itlb multihit Not affected Vulnerability L1tf Mitigation; PTE Inversion Vulnerability Mds Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown Mitigation; PTI Vulnerability Spec store bypass Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1 Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2 Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds Not affected Vulnerability Tsx async abort Mitigation; Clear CPU buffers; SMT Host state unknown



Tried installing ViSSL with both direct installation and from source and face the same issue: training not progressing 

If your issue looks like an installation issue / environment issue,
please first try to solve it with the instructions in
https://github.com/facebookresearch/vissl/tree/main/docs

QuentinDuval commented 2 years ago

Hi @kulkarnivishal,

Thank you for using VISSL :) Sorry for the late answer (I got first COVID then went into 1 month PTO).

So this really looks like an environment issue with distributed training. The initialisation of the distributed group seems to have gone fine, but maybe the test of the distributed training has failed:

In the code of trainer_main.py, there is a call to dist.all_reduce(torch.zeros(1).cuda()) right after the initialisation of the distributed training that we saw in your logs. It might be what is failing but we need to make sure of it to decide on the next steps.

If you installed from source, could you add some logs around the dist.all_reduce(torch.zeros(1).cuda()) in the setup_distributed function of the trainer_main.py? Could you also add some logs in the following places:

before and after self.task = build_task(self.cfg) in trainer_main.py
before and after self.task.init_distributed_data_parallel_model() in trainer_main.py

And then re-run your exact command to check what we get.

Thank you, Quentin

tungts1101 commented 9 months ago

Today I have tried vissl and stumbled upon the same error. I have followed your suggestion but nothing printed out to the screen.

facebookresearch / vissl

Model not training, training is stuck without errors #548

Instructions To Reproduce the Issue: