Open leonmakise opened 4 years ago
@leonmakise hello,i met the same error,when i tried distributed training.. ...i see your changes,but i can not understand what it means
It shall be self.model.cuda() ori code:
self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
your change is:
self.model.cuda() =nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
thank you~~looking for your reply
It worked by single GPU traing. But it failed no matter how many GPUs I appointed when I tried distributed training.
https://github.com/LikeLy-Journey/SegmenTron/blob/4bc605eedde7d680314f63d329277b73f83b1c5f/tools/train.py#L109 It shall be
self.model.cuda()
It works when I change this line.
The following part is the message of error I met with the former code: (faceparsing) mjq@amax:~/SegmenTron$ CUDA_VISIBLE_DEVICES=0,7 ./tools/dist_train.sh ${CONFIG_FILE} configs/pascal_voc_deeplabv3_plus.yaml ${GPU_NUM} 2
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2020-06-06 02:21:55,815 Segmentron INFO: Using 2 GPUs 2020-06-06 02:21:55,816 Segmentron INFO: Namespace(config_file='configs/pascal_voc_deeplabv3_plus.yaml', device='cuda', distributed=True, input_img='tools/demo_vis.png', local_rank=0, log_iter=10, no_cuda=False, num_gpus=2, opts=[], resume=None, skip_val=False, val_epoch=1) 2020-06-06 02:21:55,816 Segmentron INFO: { "SEED": 1024, "TIME_STAMP": "2020-06-06-02-21", "ROOT_PATH": "/data1/mjq/SegmenTron", "PHASE": "train", "DATASET": { "NAME": "pascal_voc", "MEAN": [ 0.5, 0.5, 0.5 ], "STD": [ 0.5, 0.5, 0.5 ], "IGNORE_INDEX": -1, "WORKERS": 4, "MODE": "val" }, "AUG": { "MIRROR": true, "BLUR_PROB": 0.0, "BLUR_RADIUS": 0.0, "COLOR_JITTER": null }, "TRAIN": { "EPOCHS": 50, "BATCH_SIZE": 4, "CROP_SIZE": 480, "BASE_SIZE": 520, "MODEL_SAVE_DIR": "runs/checkpoints/", "LOG_SAVE_DIR": "runs/logs/", "PRETRAINED_MODEL_PATH": "", "BACKBONE_PRETRAINED": true, "BACKBONE_PRETRAINED_PATH": "", "RESUME_MODEL_PATH": "", "SYNC_BATCH_NORM": true, "SNAPSHOT_EPOCH": 10 }, "SOLVER": { "LR": 0.0001, "OPTIMIZER": "sgd", "EPSILON": 1e-08, "MOMENTUM": 0.9, "WEIGHT_DECAY": 0.0001, "DECODER_LR_FACTOR": 10.0, "LR_SCHEDULER": "poly", "POLY": { "POWER": 0.9 }, "STEP": { "GAMMA": 0.1, "DECAY_EPOCH": [ 10, 20 ] }, "WARMUP": { "EPOCHS": 0.0, "FACTOR": 0.3333333333333333, "METHOD": "linear" }, "OHEM": false, "AUX": false, "AUX_WEIGHT": 0.4, "LOSS_NAME": "" }, "TEST": { "TEST_MODEL_PATH": "", "BATCH_SIZE": 8, "CROP_SIZE": null, "SCALES": [ 1.0 ], "FLIP": false }, "VISUAL": { "OUTPUT_DIR": "../runs/visual/" }, "MODEL": { "MODEL_NAME": "DeepLabV3_Plus", "BACKBONE": "xception65", "BACKBONE_SCALE": 1.0, "MULTI_LOSS_WEIGHT": [ 1.0 ], "DEFAULT_GROUP_NUMBER": 32, "DEFAULT_EPSILON": 1e-05, "BN_TYPE": "BN", "BN_EPS_FOR_ENCODER": 0.001, "BN_EPS_FOR_DECODER": null, "OUTPUT_STRIDE": 16, "BN_MOMENTUM": null, "DEEPLABV3_PLUS": { "USE_ASPP": true, "ENABLE_DECODER": true, "ASPP_WITH_SEP_CONV": true, "DECODER_USE_SEP_CONV": true }, "CCNET": { "RECURRENCE": 2 } } } Found 1464 images in the folder datasets/voc/VOC2012 Found 1464 images in the folder datasets/voc/VOC2012 Found 1449 images in the folder datasets/voc/VOC2012 Found 1449 images in the folder datasets/voc/VOC2012 2020-06-06 02:21:56,181 Segmentron INFO: load backbone pretrained model from url.. 2020-06-06 02:21:56,480 Segmentron INFO:
Traceback (most recent call last):
File "./tools/train.py", line 223, in
trainer = Trainer(args)
File "./tools/train.py", line 112, in init
find_unused_parameters=True)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init
).format(device_ids, output_device, {p.device for p in module.parameters()})
AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [1], output_device 1, and module parameters {device(type='cuda', index=1), device(type='cpu')}.
2020-06-06 02:21:57,748 Segmentron INFO: DeepLabV3Plus flops: 413.257G input shape is [3, 1024, 2048], params: 41.055M
2020-06-06 02:21:57,776 Segmentron INFO: SyncBatchNorm is effective!
2020-06-06 02:21:57,776 Segmentron INFO: Set bn custom eps for bn in encoder: 0.001
Traceback (most recent call last):
File "./tools/train.py", line 223, in
trainer = Trainer(args)
File "./tools/train.py", line 112, in init
find_unused_parameters=True)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init
).format(device_ids, output_device, {p.device for p in module.parameters()})
AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cpu')}.
Traceback (most recent call last):
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/data1/mjq/anaconda3/envs/faceparsing/bin/python', '-u', './tools/train.py', '--local_rank=1', '--config-file', 'configs/pascal_voc_deeplabv3_plus.yaml']' returned non-zero exit status 1.
Thanks for your attention! @LikeLy-Journey