Closed abhaydoke09 closed 4 years ago
It shouldn't happen in normal condition for a V100, check if your GPU memory is occupied through nvidia-smi
, then kill those processes, and start training again.
If that doesn't help reduce the batch size.
It's interesting. It suddenly fails while model is being loaded. I have trained retinanet in pytorch on the same machine. Is there a way to restrict training on a single GPU? 3 GPUS are completely free. One GPU was in use.
CUDA_VISIBLE_DEVICES=0 odtk train retinanet_rn50fpn.pth --backbone ResNet50FPN \ --images /coco/images/train2017/ --annotations /coco/annotations/instances_train2017.json \ --val-images /coco/images/val2017/ --val-annotations /coco/annotations/instances_val2017.json
Worked!!
This means this code is forcefully using all available GPUs on a machine. This is not good for shared GPU machines where multiple people are using the same machine.
You might want to add a flag to select a GPU id/number from available GPUs
You can also specify the devices when starting the container.