bethgelab / siamese-mask-rcnn

Siamese Mask R-CNN model for one-shot instance segmentation
Other
346 stars 60 forks source link

Out of memory #3

Closed DL-Alva closed 5 years ago

DL-Alva commented 5 years ago

When I try to run train.ipynb on my server, it prompts out of memory and stops running when epochs is up to 40. The memory on my server is about 11G. How can I deal with it to run this program on my server successfully?

michaelisc commented 5 years ago

Do you mean 11G of GPU memory or RAM?

As the training is set up in the train.ipynb it starts training deeper layers from epoch 40 onwards. This is why it requires more memory from that point on. If you want to reduce memory consumption you should reduce the batch size and scale the learning rate appropriately:

new_lr = old_lr * (new_batch_size/old_batch_size)

So if you half the batch size you should also halve the learning rate.

michaelisc commented 5 years ago

Oh. And I found a small bug. The training should be done on the complete network and not the heads alone from the second epoch on. I just fixed this, so for the correct schedule please pull the repo again.

Also: if you want to reach the sampe performance we do in the paper please use the training notebooks from the experiments folder. If you don't have 4 GPUs just reduce that number to 1 and reduce the learning rate accordingly new_lr = old_lr\4. I never tried it but this should give you the same results we got. As the network is bigger and the images larger it will however also take longer to train.

DL-Alva commented 5 years ago

Thank you very much for your answer. I will run the code again according to your suggestion.

DL-Alva commented 5 years ago

GitHub

I rerun the code you fixed. However it still prompts out of memory and need to allocating 1.87GiB while 3GiB more at the last time. Should I reduce the learning rate again?

michaelisc commented 5 years ago

Yes. You should try to further reduce the batch size and learning rate. Probably try

GPU_COUNT = 1
IMAGES_PER_GPU = 6
LEARNING_RATE = 0.01

So you have a batch size of 6 and halved the learning rate. If that works you can try to increase the batch_size again.

DL-Alva commented 5 years ago

Thank you for your reply and I will try again according to your suggestion. But it seems that the batch size in the origin code is 1 and couldn't be reduced.

michaelisc commented 5 years ago

The default batch size is set to 1 in lib/config.py but it is overwritten in the TrainConfig sub-class defined in the train.ipynb notebook. That's where you have to change it.

DL-Alva commented 5 years ago

I got it! Thank you for your reminder.