LikeLy-Journey / SegmenTron

Support PointRend, Fast_SCNN, HRNet, Deeplabv3_plus(xception, resnet, mobilenet), ContextNet, FPENet, DABNet, EdaNet, ENet, Espnetv2, RefineNet, UNet, DANet, HRNet, DFANet, HardNet, LedNet, OCNet, EncNet, DuNet, CGNet, CCNet, BiSeNet, PSPNet, ICNet, FCN, deeplab)
Apache License 2.0
705 stars 162 forks source link

Bug in tools/train.py #59

Open leonmakise opened 4 years ago

leonmakise commented 4 years ago

It worked by single GPU traing. But it failed no matter how many GPUs I appointed when I tried distributed training.

https://github.com/LikeLy-Journey/SegmenTron/blob/4bc605eedde7d680314f63d329277b73f83b1c5f/tools/train.py#L109 It shall be self.model.cuda()

It works when I change this line.

The following part is the message of error I met with the former code: (faceparsing) mjq@amax:~/SegmenTron$ CUDA_VISIBLE_DEVICES=0,7 ./tools/dist_train.sh ${CONFIG_FILE} configs/pascal_voc_deeplabv3_plus.yaml ${GPU_NUM} 2


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2020-06-06 02:21:55,815 Segmentron INFO: Using 2 GPUs 2020-06-06 02:21:55,816 Segmentron INFO: Namespace(config_file='configs/pascal_voc_deeplabv3_plus.yaml', device='cuda', distributed=True, input_img='tools/demo_vis.png', local_rank=0, log_iter=10, no_cuda=False, num_gpus=2, opts=[], resume=None, skip_val=False, val_epoch=1) 2020-06-06 02:21:55,816 Segmentron INFO: { "SEED": 1024, "TIME_STAMP": "2020-06-06-02-21", "ROOT_PATH": "/data1/mjq/SegmenTron", "PHASE": "train", "DATASET": { "NAME": "pascal_voc", "MEAN": [ 0.5, 0.5, 0.5 ], "STD": [ 0.5, 0.5, 0.5 ], "IGNORE_INDEX": -1, "WORKERS": 4, "MODE": "val" }, "AUG": { "MIRROR": true, "BLUR_PROB": 0.0, "BLUR_RADIUS": 0.0, "COLOR_JITTER": null }, "TRAIN": { "EPOCHS": 50, "BATCH_SIZE": 4, "CROP_SIZE": 480, "BASE_SIZE": 520, "MODEL_SAVE_DIR": "runs/checkpoints/", "LOG_SAVE_DIR": "runs/logs/", "PRETRAINED_MODEL_PATH": "", "BACKBONE_PRETRAINED": true, "BACKBONE_PRETRAINED_PATH": "", "RESUME_MODEL_PATH": "", "SYNC_BATCH_NORM": true, "SNAPSHOT_EPOCH": 10 }, "SOLVER": { "LR": 0.0001, "OPTIMIZER": "sgd", "EPSILON": 1e-08, "MOMENTUM": 0.9, "WEIGHT_DECAY": 0.0001, "DECODER_LR_FACTOR": 10.0, "LR_SCHEDULER": "poly", "POLY": { "POWER": 0.9 }, "STEP": { "GAMMA": 0.1, "DECAY_EPOCH": [ 10, 20 ] }, "WARMUP": { "EPOCHS": 0.0, "FACTOR": 0.3333333333333333, "METHOD": "linear" }, "OHEM": false, "AUX": false, "AUX_WEIGHT": 0.4, "LOSS_NAME": "" }, "TEST": { "TEST_MODEL_PATH": "", "BATCH_SIZE": 8, "CROP_SIZE": null, "SCALES": [ 1.0 ], "FLIP": false }, "VISUAL": { "OUTPUT_DIR": "../runs/visual/" }, "MODEL": { "MODEL_NAME": "DeepLabV3_Plus", "BACKBONE": "xception65", "BACKBONE_SCALE": 1.0, "MULTI_LOSS_WEIGHT": [ 1.0 ], "DEFAULT_GROUP_NUMBER": 32, "DEFAULT_EPSILON": 1e-05, "BN_TYPE": "BN", "BN_EPS_FOR_ENCODER": 0.001, "BN_EPS_FOR_DECODER": null, "OUTPUT_STRIDE": 16, "BN_MOMENTUM": null, "DEEPLABV3_PLUS": { "USE_ASPP": true, "ENABLE_DECODER": true, "ASPP_WITH_SEP_CONV": true, "DECODER_USE_SEP_CONV": true }, "CCNET": { "RECURRENCE": 2 } } } Found 1464 images in the folder datasets/voc/VOC2012 Found 1464 images in the folder datasets/voc/VOC2012 Found 1449 images in the folder datasets/voc/VOC2012 Found 1449 images in the folder datasets/voc/VOC2012 2020-06-06 02:21:56,181 Segmentron INFO: load backbone pretrained model from url.. 2020-06-06 02:21:56,480 Segmentron INFO: Traceback (most recent call last): File "./tools/train.py", line 223, in trainer = Trainer(args) File "./tools/train.py", line 112, in init find_unused_parameters=True) File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init ).format(device_ids, output_device, {p.device for p in module.parameters()}) AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [1], output_device 1, and module parameters {device(type='cuda', index=1), device(type='cpu')}. 2020-06-06 02:21:57,748 Segmentron INFO: DeepLabV3Plus flops: 413.257G input shape is [3, 1024, 2048], params: 41.055M 2020-06-06 02:21:57,776 Segmentron INFO: SyncBatchNorm is effective! 2020-06-06 02:21:57,776 Segmentron INFO: Set bn custom eps for bn in encoder: 0.001 Traceback (most recent call last): File "./tools/train.py", line 223, in trainer = Trainer(args) File "./tools/train.py", line 112, in init find_unused_parameters=True) File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init ).format(device_ids, output_device, {p.device for p in module.parameters()}) AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cpu')}. Traceback (most recent call last): File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in main() File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/data1/mjq/anaconda3/envs/faceparsing/bin/python', '-u', './tools/train.py', '--local_rank=1', '--config-file', 'configs/pascal_voc_deeplabv3_plus.yaml']' returned non-zero exit status 1.

Thanks for your attention! @LikeLy-Journey

jiawenhao2015 commented 4 years ago

@leonmakise hello,i met the same error,when i tried distributed training.. ...i see your changes,but i can not understand what it means

It shall be self.model.cuda() ori code: self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)

your change is: self.model.cuda() =nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True) thank you~~looking for your reply