Open MaureenZOU opened 4 years ago
Did this repo work when you gave 2 GPU ids in the argument? You had to make any changes in the code?
Hello @shipra25jain , It works with any number of GPUs.
Actually, when I train on 2 GPUs and 4 GPUs machine. The performance did variate, with a 2 percent drop on 4 GPUs machine. From my point of view, as it doesn't use global BN, thus per GPU batch size did value a lot.
@VainF if my experiment did has any problem. Please point out!
Hello @shipra25jain , It works with any number of GPUs.
Thanks @VainF for the reply. It seems to be working now on adding 'device_ids' in DataParallel() as my default gpu_ids are not 0 and 1 but 5 and 7. However, there seems to be a bug in polyLR scheduler. Shouldn't it be (1 - last_epoch/max_epochs)**power ? I mean instead of max_iters in formula, it should be max_epochs?
Actually, when I train on 2 GPUs and 4 GPUs machine. The performance did variate, with a 2 percent drop on 4 GPUs machine. From my point of view, as it doesn't use global BN, thus per GPU batch size did value a lot.
@MaureenZOU, Yes, batch size is an important hyper param for BN. It is recommended to use a large batch size (e.g. >8). As far as know, there is no SyncBN in pytorch. Please try third party implementations if SyncBN is required.
Hello @shipra25jain , It works with any number of GPUs.
Thanks @VainF for the reply. It seems to be working now on adding 'device_ids' in DataParallel() as my default gpu_ids are not 0 and 1 but 5 and 7. However, there seems to be a bug in polyLR scheduler. Shouldn't it be (1 - last_epoch/max_epochs)**power ? I mean instead of max_iters in formula, it should be max_epochs?
@shipra25jain, thank you for pointing out this issue. In this repo, the learning rate is scheduled at each iteration, so last_epoch
actually means last_iter
. I will rename it to make the code more straightforward.
This repo is really nice, performance on pascal voc could be reproduce using 2 gpus with batchsize=16.