CUDA out of memory after first epoch

StevenChumak commented 2 years ago

Hi,

I am trying to train a ocrnet.HRNet_Mscale model with a custom dataset but I am running out of memory after the first epoch and I can not figure out where my problem lies.

I am using 2x RTX 2080 TI with 11GB each. Cuda Version: 10.2 Python: 3.6.9 Pytorch Version: 1.10.0+cu102

My CLI to launch a training:

python train.py \
--arch ocrnet.HRNet_Mscale \
--lr 5e-3 \
--n_scales "0.5,1.0,2.0" \
--dataset xxx \
--crop_size "480, 960" \
--bs_trn 1 \
--trunk "resnet50"

The Dataloader (used cityscapes dataloader as orientation):

class Loader(BaseLoader):
    num_classes = 2
    ignore_label = 255
    trainid_to_name = {}
    color_mapping = []

    def __init__(self, mode, quality='semantic', joint_transform_list=None,
                 img_transform=None, label_transform=None, eval_folder=None):

        super(Loader, self).__init__(quality=quality,
                                     mode=mode,
                                     joint_transform_list=joint_transform_list,
                                     img_transform=img_transform,
                                     label_transform=label_transform)

        self.root = cfg.DATASET.TRAINRAILS_DIR

        splits = {'train': 'trn',
                    'val': 'val'
                 }
        split_name = splits[mode]
        img_ext = 'png'
        mask_ext = 'png'
        img_root = os.path.join(self.root, 'img', split_name)
        mask_root = os.path.join(self.root, 'msk', split_name)

        self.all_imgs = self.find_images(img_root, mask_root, img_ext,
                                             mask_ext)

        logx.msg(f'cn num_classes {self.num_classes}')
        self.centroids = uniform.build_centroids(self.all_imgs,
                                                      self.num_classes,
                                                      self.train,
                                                      cv=cfg.DATASET.CV)

        self.build_epoch()

And the error:

> --arch ocrnet.HRNet_Mscale \
> --lr 5e-3 \
> --n_scales "0.5,1.0,2.0" \
> --dataset xxx \
> --crop_size "480, 960" \
> --bs_trn 1
None
Using regular batch norm
dataset = xxx
ignore_label = 255
cn num_classes 2
cn num_classes 2
Loading centroid file /home/<Me>/Desktop/uniform_centroids/xxx_cv0_tile1024.json
Found 2 centroids
Class Uniform Percentage: 0.5
Class Uniform items per Epoch: 25
cls 0 len 150
cls 1 len 92
Using Cross Entropy Loss
Using Cross Entropy Loss
Trunk: hrnetv2
Model params = 72.1M
/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/functional.py:3680: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  "The default behavior for interpolate/upsample with float scale_factor changed "
/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
[epoch 0], [iter 1 / 12], [train main loss 0.935954], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 2 / 12], [train main loss 0.971551], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 3 / 12], [train main loss 0.924270], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 4 / 12], [train main loss 0.910658], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 5 / 12], [train main loss 0.838493], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 6 / 12], [train main loss 0.839687], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 7 / 12], [train main loss 0.796560], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 8 / 12], [train main loss 0.821920], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 9 / 12], [train main loss 0.880481], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 10 / 12], [train main loss 0.881139], [lr 0.005000] [batchtime 0]
[epoch 0], [iter 11 / 12], [train main loss 0.885493], [lr 0.005000] [batchtime 0.547]
[epoch 0], [iter 12 / 12], [train main loss 0.876890], [lr 0.005000] [batchtime 0.519]
Traceback (most recent call last):
  File "train.py", line 650, in <module>
    main()
  File "train.py", line 504, in main
    validate(val_loader, net, criterion_val, optim, epoch)
  File "train.py", line 622, in validate
    args, val_idx)
  File "/home/<Me>/Desktop/semantic-segmentation/utils/trnval_utils.py", line 141, in eval_minibatch
    output_dict = net(inputs)
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<Me>/Desktop/semantic-segmentation/network/ocrnet.py", line 332, in forward
    return self.nscale_forward(inputs, cfg.MODEL.N_SCALES)
  File "/home/<Me>/Desktop/semantic-segmentation/network/ocrnet.py", line 224, in nscale_forward
    outs = self._fwd(x)
  File "/home/<Me>/Desktop/semantic-segmentation/network/ocrnet.py", line 173, in _fwd
    _, _, high_level_features = self.backbone(x)
  File "/home/<Me>/Desktop/semantic-segmentation/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<Me>/Desktop/semantic-segmentation/network/hrnetv2.py", line 447, in forward
    feats = torch.cat([x[0], x1, x2, x3], 1)
RuntimeError: CUDA out of memory. Tried to allocate 4.28 GiB (GPU 0; 10.76 GiB total capacity; 6.78 GiB already allocated; 211.19 MiB free; 8.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Could somebody point me into a direction, on where to look for possible solutions?

Thanks

ajtao commented 2 years ago

It's running out of memory in the validation stage. Are you using fp16 ? That will save memory. You could try to lower the multi-scale inference a little to --n_scales "0.5,1.0" or --n_scales "0.5,1.0,1.5" which should also help a little.

StevenChumak commented 2 years ago

Thank you! Using boththe fp16 flag and dropping the inference scales down to 0.5 and 1.0 finally allowed me to train the model.

NVIDIA / semantic-segmentation

CUDA out of memory after first epoch #171