Open LvShuaiChao opened 3 years ago
Hi, I met the same errors. Did you solve it?
Please turn off distributed training in train.py and do not use syncBN.
If you use single GPU (not distributed mode) then you should change function in lib/models/bn_helper.py
[line 10]
# BatchNorm2d_class = BatchNorm2d = torch.nn.SyncBatchNorm
BatchNorm2d_class = BatchNorm2d = torch.nn.BatchNorm2d
Please turn off distributed training in train.py and do not use syncBN.
do u know how to turn off distributed training in train.py? many thanks
I trained on a single GPU for this split network and made the following changes
(before) GPUS: (0,1,2,3) WORKERS: 4 | v (after) GPUS: (0,) WORKERS: 1
`len(gpus): 1 Namespace(cfg='../experiments/lip/seg_hrnet_w48_473x473_sgd_lr7e-3_wd5e-4_bs_40_epoch150.yaml', local_rank=0, opts=[]) AUTO_RESUME: False CUDNN: BENCHMARK: True DETERMINISTIC: False ENABLED: True DATASET: DATASET: lip EXTRA_TRAIN_SET: NUM_CLASSES: 20 ROOT: ../data/ TEST_SET: list/lip/valList.txt TRAIN_SET: list/lip/trainList.txt DEBUG: DEBUG: False SAVE_BATCH_IMAGES_GT: False SAVE_BATCH_IMAGES_PRED: False SAVE_HEATMAPS_GT: False SAVE_HEATMAPS_PRED: False GPUS: ('0',) LOG_DIR: log LOSS: CLASS_BALANCE: True OHEMKEEP: 131072 OHEMTHRES: 0.9 USE_OHEM: False MODEL: EXTRA: FINAL_CONV_KERNEL: 1 STAGE1: BLOCK: BOTTLENECK FUSE_METHOD: SUM NUM_BLOCKS: [4] NUM_CHANNELS: [64] NUM_MODULES: 1 NUM_RANCHES: 1 STAGE2: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4] NUM_BRANCHES: 2 NUM_CHANNELS: [48, 96] NUM_MODULES: 1 STAGE3: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4, 4] NUM_BRANCHES: 3 NUM_CHANNELS: [48, 96, 192] NUM_MODULES: 4 STAGE4: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4, 4, 4] NUM_BRANCHES: 4 NUM_CHANNELS: [48, 96, 192, 384] NUM_MODULES: 3 NAME: seg_hrnet PRETRAINED: pretrained_models/hrnetv2_w48_imagenet_pretrained.pth OUTPUT_DIR: output PIN_MEMORY: True PRINT_FREQ: 10 RANK: 0 TEST: BASE_SIZE: 473 BATCH_SIZE_PER_GPU: 16 CENTER_CROP_TEST: False FLIP_TEST: False IMAGE_SIZE: [473, 473] MODEL_FILE: MULTI_SCALE: False NUM_SAMPLES: 2000 SCALE_LIST: [1] TRAIN: BASE_SIZE: 473 BATCH_SIZE_PER_GPU: 10 BEGIN_EPOCH: 0 DOWNSAMPLERATE: 1 END_EPOCH: 150 EXTRA_EPOCH: 0 EXTRA_LR: 0.001 FLIP: True IGNORE_LABEL: 255 IMAGE_SIZE: [473, 473] LR: 0.007 LR_FACTOR: 0.1 LR_STEP: [90, 110] MOMENTUM: 0.9 MULTI_SCALE: True NESTEROV: False NUM_SAMPLES: 0 OPTIMIZER: sgd RESUME: True SCALE_FACTOR: 11 SHUFFLE: True WD: 0.0005 WORKERS: 1 => init weights from normal distribution D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\nn\functional.py:3458: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode) D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\nn\functional.py:3328: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
Total Parameters: 65,860,100
Total Multiply Adds (For Convolution and Linear Layers only): 75.98179006576538 GFLOPs
Number of Layers Conv2d : 307 layers BatchNorm2d : 306 layers ReLU : 269 layers Bottleneck : 4 layers BasicBlock : 104 layers HighResolutionModule : 8 layers
main()
File "E:/Documents/Desktop/HRNet/segmentation/HRNet-Semantic-Segmentation-pytorch-v1.1/tools/train.py", line 211, in main
model, device_ids=[args.local_rank], output_device=args.local_rank)
File "D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\nn\parallel\distributed.py", line 401, in init
self.process_group = _get_default_group()
File "D:\Program\Anaconda3\envs\torch_1.8\lib\site-packages\torch\distributed\distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
`
Traceback (most recent call last): File "E:/Documents/Desktop/HRNet/segmentation/HRNet-Semantic-Segmentation-pytorch-v1.1/tools/train.py", line 297, in