Duankaiwen / CenterNet

Codes for our paper "CenterNet: Keypoint Triplets for Object Detection" .
MIT License
1.86k stars 384 forks source link

CUDA OOM on batch size 1 (batch norm) #116

Closed avinashkaur93 closed 4 years ago

avinashkaur93 commented 4 years ago

Hi @Duankaiwen I replicated your code and ran several experiments successfully on the COCO dataset using the following env: PyTorch 1.0.0, CUDA: 10.1.168. gcc=5.4.0

On the same environment, with my own dataset, I get a CUDA OOM (single GPU, batch size = 1). My input image size is the same [511, 511]. The training does run for about ~400 steps before it suddenly shows OOM. There's no steady increase in the GPU memory, so this couldn't be any memory leakage as well.

Here's the complete log trace and config: log.txt

Last few lines of the log:

File "/mnt/dfs/avinashk/CenterNet/CenterNet-owndata-tensorboard/CenterNet/models/py_utils/utils.py", line 15, in forward bn = self.bn(conv) File "/home/avinashk/miniconda3/envs/CenterNet-PT10-TF/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs) File "/home/avinashk/miniconda3/envs/CenterNet-PT10-TF/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward exponential_average_factor, self.eps) File "/home/avinashk/miniconda3/envs/CenterNet-PT10-TF/lib/python3.6/site-packages/torch/nn/functional.py", line 1623, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 3.71 GiB (GPU 0; 10.92 GiB total capacity; 7.31 GiB already allocated; 2.67 GiB free; 25.91 MiB cached)

Mainly it is the error in batch norm that confuses me. I'm a Tensorflow user and fairly new to PyTorch. Any help would be appreciated.