Closed songkq closed 4 years ago
The batch_size
does matter.
If you want to train the baseline with batch_size
16, just set batch_size=16
, the program will divide it by the number of GPUs (it is 2 in your setting). This is done by https://github.com/ZJULearning/RMI/blob/e3ada00f90104d1bb58726c4c28d6d50fd3bdf28/train.py#L133
Because we use SynchronizedBatchNorm2d
, the results should be the same as batch_size=16 on a single 1 TITAN RTX GPUs
.
In a word, the batch_size
should be the global batch size.
epochs
and lr
will vary with different batch_size
automatically.
Hi, @mzhaoshuai Many thanks.
When using the torch.nn.DataParallel
for multi-gpus training, it will compute the loss within the main gpu (gpu-0), which will results in memory usage is unbalanced for each gpu
. As for this circumstances, I couldn't train the baseline with global batch_size=16
(CUDA out of memory
) with two 2080Ti gpus.
I wonder if using the torch.nn.parallel.DistributedDataParallel
and model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
with batch_size=8 per gpu
for distributed training on two gpus
can also achieve similar results as your reports (train with torch.nn.DataParallel
and model.sync_bn.syncbn.BatchNorm2d
)?
Hi, @mzhaoshuai Many thanks.
When using the
torch.nn.DataParallel
for multi-gpus training, it will compute the loss within the main gpu (gpu-0), which will results inmemory usage is unbalanced for each gpu
. As for this circumstances, I couldn't train the baseline withglobal batch_size=16
(CUDA out of memory
) with two 2080Ti gpus.I wonder if using the
torch.nn.parallel.DistributedDataParallel
andmodel = nn.SyncBatchNorm.convert_sync_batchnorm(model)
withbatch_size=8 per gpu
fordistributed training on two gpus
can also achieve similar results as your reports (train withtorch.nn.DataParallel
andmodel.sync_bn.syncbn.BatchNorm2d
)?
In fact, the loss is calculating on different GPUs in our setting: https://github.com/ZJULearning/RMI/blob/e3ada00f90104d1bb58726c4c28d6d50fd3bdf28/full_model.py#L69 https://github.com/ZJULearning/RMI/blob/e3ada00f90104d1bb58726c4c28d6d50fd3bdf28/train.py#L206
However, as you said, there is still unbalanced memory usage of various GPUs.
Theoretically, distributed training on two gpus
can definitely achieve similar results (Sadly, the code in the repo does not support distributed training. You may check the Pytorch/examples
repo for help).
In the end, I have tried to train the baseline on 2 GPUs with 11GB memory (GTX 1080 Ti). There was no OOM
. This is a weird thing.
You can also try 'Accumulating Gradients'. It is slow but saves much GPU memory (Note: it will influence the statistics of BN layer.). https://discuss.pytorch.org/t/accumulating-gradients/30020
Hi, I only have two 2080Ti GPUs with memory
11G per gpu
. I'd like to train the baselinedeeplabv3 with resnet-101
as backbone andbatch_size=8 per gpu
(for 2 gpus, global batch_size=16):I wonder if it is equal to train a DeepLabv3 model with output_stride=16, crop_size=513, and
batch_size=16 on a single 1 TITAN RTX GPUs
? Will it achieve similar convergence in23 epochs
.Does the batch_size matter? If so, how can I adjust other hyperparams with batch_size=8, like epochs, lr as well as the lr_scheduler?