VainF / DeepLabV3Plus-Pytorch

Pretrained DeepLabv3 and DeepLabv3+ for Pascal VOC & Cityscapes
MIT License
2.04k stars 451 forks source link

Nice Repo! #5

Open MaureenZOU opened 4 years ago

MaureenZOU commented 4 years ago

This repo is really nice, performance on pascal voc could be reproduce using 2 gpus with batchsize=16.

shipra25jain commented 4 years ago

Did this repo work when you gave 2 GPU ids in the argument? You had to make any changes in the code?

VainF commented 4 years ago

Hello @shipra25jain , It works with any number of GPUs.

MaureenZOU commented 4 years ago

Actually, when I train on 2 GPUs and 4 GPUs machine. The performance did variate, with a 2 percent drop on 4 GPUs machine. From my point of view, as it doesn't use global BN, thus per GPU batch size did value a lot.

MaureenZOU commented 4 years ago

@VainF if my experiment did has any problem. Please point out!

shipra25jain commented 4 years ago

Hello @shipra25jain , It works with any number of GPUs.

Thanks @VainF for the reply. It seems to be working now on adding 'device_ids' in DataParallel() as my default gpu_ids are not 0 and 1 but 5 and 7. However, there seems to be a bug in polyLR scheduler. Shouldn't it be (1 - last_epoch/max_epochs)**power ? I mean instead of max_iters in formula, it should be max_epochs?

VainF commented 4 years ago

Actually, when I train on 2 GPUs and 4 GPUs machine. The performance did variate, with a 2 percent drop on 4 GPUs machine. From my point of view, as it doesn't use global BN, thus per GPU batch size did value a lot.

@MaureenZOU, Yes, batch size is an important hyper param for BN. It is recommended to use a large batch size (e.g. >8). As far as know, there is no SyncBN in pytorch. Please try third party implementations if SyncBN is required.

Hello @shipra25jain , It works with any number of GPUs.

Thanks @VainF for the reply. It seems to be working now on adding 'device_ids' in DataParallel() as my default gpu_ids are not 0 and 1 but 5 and 7. However, there seems to be a bug in polyLR scheduler. Shouldn't it be (1 - last_epoch/max_epochs)**power ? I mean instead of max_iters in formula, it should be max_epochs?

@shipra25jain, thank you for pointing out this issue. In this repo, the learning rate is scheduled at each iteration, so last_epoch actually means last_iter. I will rename it to make the code more straightforward.