GPU runs OOM when resuming training

Avani1994 commented 4 years ago

Hi thanks, for this good FRCNN library, found it after searching a lot! I am using your library to train Faster RCNN on my custom dataset, I had successfully trained 22 epochs yesterday without any OOM, today I am trying to resume training and in 23rd epoch only I am getting OOM. I have tried changing img_size ( 300, 150), num_rois to (256 to 128) along with anchor_box_scales, still had no luck, Could you please help on how should I proceed forward. I am sort of stuck in this error. Also I am running this computation in Google Colab and gpu is : Tesla P100-PCIE

Here is the exception I am getting:

Continuing training based on previous trained model Loading weights from FRCNN_vgg.hdf5 Already trained 22K batches Epoch 23/62 Exception: OOM when allocating tensor with shape[25088,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training_2/Adam/gradients/time_distributed_1/while/MatMul_grad/MatMul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Exception: OOM when allocating tensor with shape[25088,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training_2/Adam/gradients/time_distributed_1/while/MatMul_grad/MatMul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Avani1994 commented 4 years ago

Okay Training sort of resumed when I further decreased num_rois to 64, along with img_size = 150, Though its still very slow. And is giving OOM between the epochs (295/1000). Yesterday it was very fast with num_rois = 256 and img_size = 300. Can you please guide if there is something wrong, with the way I am training?

eleow commented 4 years ago

You are on the right track. Mainly reducing number of ROIs and image size should help. Remember to scale your anchor_box_scales accordingly as well.

Also, you should periodically save/backup your model after a number of epochs. I found that sometimes performance end up decreasing after too many epochs. Good luck!

eleow / tfKerasFRCNN

GPU runs OOM when resuming training #6