NVIDIA / retinanet-examples

Fast and accurate object detection with end-to-end GPU optimization
BSD 3-Clause "New" or "Revised" License
885 stars 271 forks source link

RuntimeError: CUDA error: out of memory on V100 #163

Closed abhaydoke09 closed 4 years ago

abhaydoke09 commented 4 years ago

Screen Shot 2020-03-24 at 11 30 20 PM

ghost commented 4 years ago

It shouldn't happen in normal condition for a V100, check if your GPU memory is occupied through nvidia-smi, then kill those processes, and start training again.

If that doesn't help reduce the batch size.

abhaydoke09 commented 4 years ago

It's interesting. It suddenly fails while model is being loaded. I have trained retinanet in pytorch on the same machine. Is there a way to restrict training on a single GPU? 3 GPUS are completely free. One GPU was in use.

abhaydoke09 commented 4 years ago

CUDA_VISIBLE_DEVICES=0 odtk train retinanet_rn50fpn.pth --backbone ResNet50FPN \ --images /coco/images/train2017/ --annotations /coco/annotations/instances_train2017.json \ --val-images /coco/images/val2017/ --val-annotations /coco/annotations/instances_val2017.json

Worked!!

This means this code is forcefully using all available GPUs on a machine. This is not good for shared GPU machines where multiple people are using the same machine.

You might want to add a flag to select a GPU id/number from available GPUs

james-nvidia commented 4 years ago

You can also specify the devices when starting the container.