dnn: cuda out of memory error in batch mode

ShuangLiu1992 commented

Hello Davis, I'm testing the new dnn face detector on my images and I noticed for some batch size it reports:

Error while calling cudaMalloc(&backward_data_workspace, backward_data_workspace_size_in_bytes) in file dlib/dnn/cudnn_dlibapi.cpp:908. code: 2, reason: out of memory However it goes away if I set the batch size to a even higher number and the batch size to reproduce such error seems to be random.

Please find attached my code to reproduce such error with ubuntu 16, cuda 8.0, gcc 5.4, opencv 3.0 + 640 * 360 images, batch size 4 leads to out of memory and batch size 16 doesn't. imgs is a std::vector<cv::Mat>rgb version of the test images.

auto compare_area = [](const dlib::mmod_rect &a, const dlib::mmod_rect &b) { return a.rect.area() < b.rect.area(); };

size_t batch_size = 4;
for (size_t i = 0; i < imgs.size(); i += batch_size) {
   std::vector<dlib::matrix<dlib::rgb_pixel>> images(std::min(batch_size, imgs.size() - i));
   for (size_t j = 0; j < images.size(); j++) {
       images[j] = dlib::mat(dlib::cv_image<dlib::rgb_pixel>(imgs[i + j]));
   std::vector<std::vector<dlib::mmod_rect>> boxes = net(images);
   for (size_t j = 0; j < images.size(); j++) {
       if (boxes[j].size() != 0) {
           _bounds[i + j] = std::max_element(boxes[j].begin(), boxes[j].end(), compare_area)->rect;

   progress.show_update("detecting faces");
   progress += images.size();
davisking commented

This is just an artifact of how cuDNN allocates memory and picks algorithms to run. You could try calling set_dnn_prefer_smallest_algorithms() which tells cuDNN to use less memory. That might make it behave in a less confusing way.

ShuangLiu1992 commented

hmmmm, that's odd, thank you! I will try set_dnn_prefer_smallest_algorithms()

langheran commented

Hello Davis, I am getting the same message on dnn_semantic_segmentation_train_ex, tried to downsample the crop size from 227x227 to 101x101 but now an error on the calculation of loss for the gradient descent gives an error. Tried setting the set_dnn_prefer_smallest_algorithms(); with no success. What unexplored options are left?

davisking commented

Make batch sizes smaller or reduce the size of the network. There are a lot of options.

langheran commented

Ok, it is working now :) but the process is taking too long. The batch size was downgraded from 30 to 4 being that number the greatest empirically found to be feasible.

Do you have documentation that you have made that can give me a hint about the required time?

How long does dnn_semantic_segmentation_train_ex take normally to train on Pascal VOC?

How long would you say it would take on a Quadro M500M?

How do I know if the network is converging?

Thank you Davis

langheran commented


davisking commented

These things can take several days to train on the fastest GPUs. I don't know how fast your GPU is going to be, probably a lot slower.

The solver does automatic convergence checking so don't worry about it. It's explained here: http://blog.dlib.net/2018/02/automatic-learning-rate-scheduling-that.html

langheran commented

I am renting a P5000 in Parsec (paperspace) and now is running under the original mini-batch size of 30 and now the average loss is consistently falling :D .

Do you find convenient to tune up the momentum or learning rate?


davisking commented

I usually leave those at their defaults. But you can try changing them to see what happens.