Training a DeepLabv3 model with output_stride=16, crop_size=513, and batch_size=8 on two 2080Ti GPUs

songkq commented 4 years ago

Hi, I only have two 2080Ti GPUs with memory 11G per gpu. I'd like to train the baseline deeplabv3 with resnet-101 as backbone and batch_size=8 per gpu (for 2 gpus, global batch_size=16):

input the gpu (seperate by comma (,) ): 0,1
using gpus 0,1

0  --  deeplabv3
1  --  deeplabv3+
2  --  pspnet
choose the base network: 0

0  --  resnet_v1_50
1  --  resnet_v1_101
2  --  resnet_v1_152
choose the base network: 1
The backbone is resnet101
The base model is deeplabv3

0  --  softmax cross entropy loss.
1  --  sigmoid binary cross entropy loss.
2  --  bce and RMI loss.
3  --  Affinity field loss.
5  --  Pyramid loss.
input the loss type of the first stage: 2

0 -- PASCAL VOC2012 dataset
1 -- Cityscapes
2 -- CamVid
input the dataset: 0

input the batch_size (4, 8, 12 or 16): 8
The data dir is /workspace/data/PASCAL_VOC2012/VOCdevkit/VOC2012, the batch size is 8.
make the directory /workspace/pyroom/RMISegLoss/rmi_model/rmi_re_pascal_r3_pw1_st4_si4_bp513-8_net0-1-0.5_n
Namespace(accumulation_steps=1, backbone='resnet101', base_size=513, batch_size=8, bn_mom=0.05, checkname='deeplab-resnet', crf_iter_steps=1, crop_size=513, cuda=True, data_dir='/workspace/data/PASCAL_VOC2012/VOCdevkit/VOC2012', dataset='pascal', dist_backend='nccl', distributed=True, epochs=23, eval_interval=2, freeze_bn=False, ft=False, gpu_ids=[0, 1], init_global_step=0, init_lr=0.007, local_rank=0, loss_type=2, loss_weight_lambda=0.5, lr_multiplier=10.0, lr_scheduler='poly', main_gpu=0, max_ckpt_nums=15, model_dir='/workspace/pyroom/RMISegLoss/rmi_model/rmi_re_pascal_r3_pw1_st4_si4_bp513-8_net0-1-0.5_n', momentum=0.9, multiprocessing_distributed=False, nesterov=False, no_cuda=False, no_val=False, out_stride=16, output_dir='/home/zhaoshuai/models/deeplabv3_cbl_2/', proc_name='rmi_model/rmi_re_pascal_r3_pw1_st4_si4_bp513-8_net0-1-0.5_n', resume='None', rmi_pool_size=4, rmi_pool_stride=4, rmi_pool_way=1, rmi_radius=3, save_ckpt_steps=500, seed=1, seg_model='deeplabv3', slow_start_lr=0.0001, slow_start_steps=1500, start_epoch=0, sync_bn=True, test_batch_size=8, train_split='trainaug', use_balanced_weights=False, use_sbd=False, weight_decay=0.0001, workers=8, world_size=2)
INFO:PyTorch: Using PASCAL VOC dataset, the training batch size 8 and crop size is 513.
Number of image_lists in trainaug: 10582
Number of image_lists in val: 1449
Restore parameters from the /root/.encoding/models/resnet101-2a57e44d.pth
INFO:PyTorch: Using Region Mutual Information Loss.
INFO:PyTorch: The batch norm layer is Hang Zhang's <class 'model.sync_bn.syncbn.BatchNorm2d'>
INFO:PyTorch: Using poly learning rate scheduler!
INFO:PyTorch: Starting Epoch: 0
INFO:PyTorch: Total Epoches: 23

I wonder if it is equal to train a DeepLabv3 model with output_stride=16, crop_size=513, and batch_size=16 on a single 1 TITAN RTX GPUs? Will it achieve similar convergence in 23 epochs.

Does the batch_size matter? If so, how can I adjust other hyperparams with batch_size=8, like epochs, lr as well as the lr_scheduler?

mzhaoshuai commented 4 years ago

The batch_size does matter. If you want to train the baseline with batch_size 16, just set batch_size=16, the program will divide it by the number of GPUs (it is 2 in your setting). This is done by https://github.com/ZJULearning/RMI/blob/e3ada00f90104d1bb58726c4c28d6d50fd3bdf28/train.py#L133

Because we use SynchronizedBatchNorm2d, the results should be the same as batch_size=16 on a single 1 TITAN RTX GPUs.

In a word, the batch_size should be the global batch size.

epochs and lr will vary with different batch_size automatically.

songkq commented 4 years ago

Hi, @mzhaoshuai Many thanks.

When using the torch.nn.DataParallel for multi-gpus training, it will compute the loss within the main gpu (gpu-0), which will results in memory usage is unbalanced for each gpu. As for this circumstances, I couldn't train the baseline with global batch_size=16 (CUDA out of memory) with two 2080Ti gpus.

I wonder if using the torch.nn.parallel.DistributedDataParallel and model = nn.SyncBatchNorm.convert_sync_batchnorm(model)with batch_size=8 per gpu for distributed training on two gpus can also achieve similar results as your reports (train with torch.nn.DataParallel and model.sync_bn.syncbn.BatchNorm2d)?

mzhaoshuai commented 4 years ago

Hi, @mzhaoshuai Many thanks.

When using the torch.nn.DataParallel for multi-gpus training, it will compute the loss within the main gpu (gpu-0), which will results in memory usage is unbalanced for each gpu. As for this circumstances, I couldn't train the baseline with global batch_size=16 (CUDA out of memory) with two 2080Ti gpus.

I wonder if using the torch.nn.parallel.DistributedDataParallel and model = nn.SyncBatchNorm.convert_sync_batchnorm(model)with batch_size=8 per gpu for distributed training on two gpus can also achieve similar results as your reports (train with torch.nn.DataParallel and model.sync_bn.syncbn.BatchNorm2d)?

In fact, the loss is calculating on different GPUs in our setting: https://github.com/ZJULearning/RMI/blob/e3ada00f90104d1bb58726c4c28d6d50fd3bdf28/full_model.py#L69 https://github.com/ZJULearning/RMI/blob/e3ada00f90104d1bb58726c4c28d6d50fd3bdf28/train.py#L206

However, as you said, there is still unbalanced memory usage of various GPUs.

Theoretically, distributed training on two gpus can definitely achieve similar results (Sadly, the code in the repo does not support distributed training. You may check the Pytorch/examples repo for help).

In the end, I have tried to train the baseline on 2 GPUs with 11GB memory (GTX 1080 Ti). There was no OOM. This is a weird thing.

You can also try 'Accumulating Gradients'. It is slow but saves much GPU memory (Note: it will influence the statistics of BN layer.). https://discuss.pytorch.org/t/accumulating-gradients/30020

ZJULearning / RMI

Training a DeepLabv3 model with output_stride=16, crop_size=513, and batch_size=8 on two 2080Ti GPUs #6