RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

LvShuaiChao commented 3 years ago

I trained on a single GPU for this split network and made the following changes

(before) GPUS: (0,1,2,3) WORKERS: 4 | v (after) GPUS: (0,) WORKERS: 1

`len(gpus): 1 Namespace(cfg='../experiments/lip/seg_hrnet_w48_473x473_sgd_lr7e-3_wd5e-4_bs_40_epoch150.yaml', local_rank=0, opts=[]) AUTO_RESUME: False CUDNN: BENCHMARK: True DETERMINISTIC: False ENABLED: True DATASET: DATASET: lip EXTRA_TRAIN_SET: NUM_CLASSES: 20 ROOT: ../data/ TEST_SET: list/lip/valList.txt TRAIN_SET: list/lip/trainList.txt DEBUG: DEBUG: False SAVE_BATCH_IMAGES_GT: False SAVE_BATCH_IMAGES_PRED: False SAVE_HEATMAPS_GT: False SAVE_HEATMAPS_PRED: False GPUS: ('0',) LOG_DIR: log LOSS: CLASS_BALANCE: True OHEMKEEP: 131072 OHEMTHRES: 0.9 USE_OHEM: False MODEL: EXTRA: FINAL_CONV_KERNEL: 1 STAGE1: BLOCK: BOTTLENECK FUSE_METHOD: SUM NUM_BLOCKS: [4] NUM_CHANNELS: [64] NUM_MODULES: 1 NUM_RANCHES: 1 STAGE2: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4] NUM_BRANCHES: 2 NUM_CHANNELS: [48, 96] NUM_MODULES: 1 STAGE3: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4, 4] NUM_BRANCHES: 3 NUM_CHANNELS: [48, 96, 192] NUM_MODULES: 4 STAGE4: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4, 4, 4] NUM_BRANCHES: 4 NUM_CHANNELS: [48, 96, 192, 384] NUM_MODULES: 3 NAME: seg_hrnet PRETRAINED: pretrained_models/hrnetv2_w48_imagenet_pretrained.pth OUTPUT_DIR: output PIN_MEMORY: True PRINT_FREQ: 10 RANK: 0 TEST: BASE_SIZE: 473 BATCH_SIZE_PER_GPU: 16 CENTER_CROP_TEST: False FLIP_TEST: False IMAGE_SIZE: [473, 473] MODEL_FILE: MULTI_SCALE: False NUM_SAMPLES: 2000 SCALE_LIST: [1] TRAIN: BASE_SIZE: 473 BATCH_SIZE_PER_GPU: 10 BEGIN_EPOCH: 0 DOWNSAMPLERATE: 1 END_EPOCH: 150 EXTRA_EPOCH: 0 EXTRA_LR: 0.001 FLIP: True IGNORE_LABEL: 255 IMAGE_SIZE: [473, 473] LR: 0.007 LR_FACTOR: 0.1 LR_STEP: [90, 110] MOMENTUM: 0.9 MULTI_SCALE: True NESTEROV: False NUM_SAMPLES: 0 OPTIMIZER: sgd RESUME: True SCALE_FACTOR: 11 SHUFFLE: True WD: 0.0005 WORKERS: 1 => init weights from normal distribution D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\nn\functional.py:3458: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode) D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\nn\functional.py:3328: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")

Total Parameters: 65,860,100

Total Multiply Adds (For Convolution and Linear Layers only): 75.98179006576538 GFLOPs

Number of Layers Conv2d : 307 layers BatchNorm2d : 306 layers ReLU : 269 layers Bottleneck : 4 layers BasicBlock : 104 layers HighResolutionModule : 8 layers
Traceback (most recent call last): File "E:/Documents/Desktop/HRNet/segmentation/HRNet-Semantic-Segmentation-pytorch-v1.1/tools/train.py", line 297, in main() File "E:/Documents/Desktop/HRNet/segmentation/HRNet-Semantic-Segmentation-pytorch-v1.1/tools/train.py", line 211, in main model, device_ids=[args.local_rank], output_device=args.local_rank) File "D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\nn\parallel\distributed.py", line 401, in init self.process_group = _get_default_group() File "D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\distributed\distributed_c10d.py", line 347, in _get_default_group raise RuntimeError("Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. `

cvchanghao commented 3 years ago

Hi, I met the same errors. Did you solve it?

rawalkhirodkar commented 3 years ago

Please turn off distributed training in train.py and do not use syncBN.

bolero2 commented 3 years ago

If you use single GPU (not distributed mode) then you should change function in lib/models/bn_helper.py

[line 10]

# BatchNorm2d_class = BatchNorm2d = torch.nn.SyncBatchNorm
BatchNorm2d_class = BatchNorm2d = torch.nn.BatchNorm2d

Robinxin123 commented 2 years ago

Please turn off distributed training in train.py and do not use syncBN.

do u know how to turn off distributed training in train.py? many thanks

HRNet / HRNet-Semantic-Segmentation

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. #222

Total Parameters: 65,860,100

Total Multiply Adds (For Convolution and Linear Layers only): 75.98179006576538 GFLOPs