dnn: cuda out of memory error in batch mode

ShuangLiu1992 commented 7 years ago

Hello Davis, I'm testing the new dnn face detector on my images and I noticed for some batch size it reports:

Error while calling cudaMalloc(&backward_data_workspace, backward_data_workspace_size_in_bytes) in file dlib/dnn/cudnn_dlibapi.cpp:908. code: 2, reason: out of memory However it goes away if I set the batch size to a even higher number and the batch size to reproduce such error seems to be random.

Please find attached my code to reproduce such error with ubuntu 16, cuda 8.0, gcc 5.4, opencv 3.0 + 640 * 360 images, batch size 4 leads to out of memory and batch size 16 doesn't. imgs is a std::vector<cv::Mat>rgb version of the test images.

auto compare_area = [](const dlib::mmod_rect &a, const dlib::mmod_rect &b) { return a.rect.area() < b.rect.area(); };

size_t batch_size = 4;
for (size_t i = 0; i < imgs.size(); i += batch_size) {
   std::vector<dlib::matrix<dlib::rgb_pixel>> images(std::min(batch_size, imgs.size() - i));
   for (size_t j = 0; j < images.size(); j++) {
       images[j] = dlib::mat(dlib::cv_image<dlib::rgb_pixel>(imgs[i + j]));
   }
   std::vector<std::vector<dlib::mmod_rect>> boxes = net(images);
   for (size_t j = 0; j < images.size(); j++) {
       if (boxes[j].size() != 0) {
           _bounds[i + j] = std::max_element(boxes[j].begin(), boxes[j].end(), compare_area)->rect;
       }
   }

   progress.show_update("detecting faces");
   progress += images.size();
}

davisking commented 7 years ago

This is just an artifact of how cuDNN allocates memory and picks algorithms to run. You could try calling set_dnn_prefer_smallest_algorithms() which tells cuDNN to use less memory. That might make it behave in a less confusing way.

ShuangLiu1992 commented 7 years ago

hmmmm, that's odd, thank you! I will try set_dnn_prefer_smallest_algorithms()

langheran commented 6 years ago

Hello Davis, I am getting the same message on dnn_semantic_segmentation_train_ex, tried to downsample the crop size from 227x227 to 101x101 but now an error on the calculation of loss for the gradient descent gives an error. Tried setting the set_dnn_prefer_smallest_algorithms(); with no success. What unexplored options are left?

davisking commented 6 years ago

Make batch sizes smaller or reduce the size of the network. There are a lot of options.

langheran commented 6 years ago

Ok, it is working now :) but the process is taking too long. The batch size was downgraded from 30 to 4 being that number the greatest empirically found to be feasible.

Do you have documentation that you have made that can give me a hint about the required time?

How long does dnn_semantic_segmentation_train_ex take normally to train on Pascal VOC?

How long would you say it would take on a Quadro M500M?

How do I know if the network is converging?

Thank you Davis

langheran commented 6 years ago

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.0\bin\win64\Debug\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro M500M"
  CUDA Driver Version / Runtime Version          9.1 / 9.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147483648 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1124 MHz (1.12 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

davisking commented 6 years ago

These things can take several days to train on the fastest GPUs. I don't know how fast your GPU is going to be, probably a lot slower.

The solver does automatic convergence checking so don't worry about it. It's explained here: http://blog.dlib.net/2018/02/automatic-learning-rate-scheduling-that.html

langheran commented 6 years ago

I am renting a P5000 in Parsec (paperspace) and now is running under the original mini-batch size of 30 and now the average loss is consistently falling :D .

Do you find convenient to tune up the momentum or learning rate?

davisking commented 6 years ago

I usually leave those at their defaults. But you can try changing them to see what happens.

davisking / dlib

dnn: cuda out of memory error in batch mode #322