davisking / dlib

A toolkit for making real world machine learning and data analysis applications in C++
http://dlib.net
Boost Software License 1.0
13.58k stars 3.38k forks source link

DNN face detection speed issue #401

Closed pythonanonuser closed 7 years ago

pythonanonuser commented 7 years ago

I've been playing around with the new face detector in /examples/dnn_mmod_face_detection_ex.cpp. I ran the program on the /examples/faces directory exactly as written in the example file itself, but I'm noticing the speed is nowhere near 45-50 ms per image.

I've enabled AVX instructions, compiled in release mode, and I'm testing on a NVIDIA Tesla K80 with CUDA 8.0 and cuDNN 5.1. I noticed that after commenting out the upsampling lines, the speed improves, but still nowhere to the tune of 50 ms per image.

Is there something missing in the example code that I need? Your help is appreciated, thanks!

davisking commented 7 years ago

What speeds are you getting?

Also, I don't think the K80 is as fast as a titan x, which is what was used to generate the timing numbers in the dlib blog, which I assume is what you are referring to. Also, you aren't timing the whole program's execution are you?

pythonanonuser commented 7 years ago

@davisking I'm just timing the line auto dets = net(img); and I've commented out all the window display code. Here are the times I'm getting (in ms) for the images in the faces folder in the examples directory:

2990, 2502, 2698, 1753, 2385, 2678, 906, 2212, 2836, 900

davisking commented 7 years ago

If those are the times then I don't see how it could possibly be running on a K80. I bet you are actually running on a CPU, and you don't even have any BLAS library installed. How do you know it's using CUDA?

pythonanonuser commented 7 years ago

@davisking Just FYI for those running times I did not comment out upsampling the image code present in the example program. The only thing I changed was commenting out the window display code.

I'm not sure how to verify that the program is using CUDA, but based on the cmake output during the dlib installation I know dlib linked to mkl and CUDA 8.0 and cuDNN 5.1. All the CUDA stuff was built as well.

davisking commented 7 years ago

If you want to compare to the times in the blog post you need to use images of the same size. You can use nvidia-smi to see if you are actually running anything on the gpu.

pythonanonuser commented 7 years ago

@davisking I verified with nvidia-smi that the process is using the GPU.

I took a look at the some of the pictures in the faces directory. They seem to be roughly 640x480. I feel like the times I posted are too slow even for slightly larger images and possibly a slightly slower GPU. Is there anything I'm missing. I'd like to run the detector on a couple thousand images.

davisking commented 7 years ago

The example program upsamples the images until they are at least 1800x1800, which is hugely different from 640x480. Also, are you timing multiple calls to the network? The first call has startup overhead because CUDA has a long startup sequence where it hooks into the drivers and sets up the CUDA runtime environment.

pythonanonuser commented 7 years ago

@davisking commenting out the upsampling lines lead to the following run times in ms:

215, 56, 57, 37, 51, 56, 63, 55, 57, 173

These times seem much more reasonable. However as expected, without upsampling it only found faces in 4/10 images instead of the actual 9/10. Is upsampling to 1800x1800 really necessary for full fledged accuracy of this detector? Is there a smaller size I could upsample to or is 1800x1800 my safest bet here?

For your reference, I've posted the exact code I ran. Let me know if I've missed something

// The contents of this file are in the public domain. See LICENSE_FOR_EXAMPLE_PROGRAMS.txt
/*
    This example shows how to run a CNN based face detector using dlib.  The
    example loads a pretrained model and uses it to find faces in images.  The
    CNN model is much more accurate than the HOG based model shown in the
    face_detection_ex.cpp example, but takes much more computational power to
    run, and is meant to be executed on a GPU to attain reasonable speed.  For
    example, on a NVIDIA Titan X GPU, this example program processes images at
    about the same speed as face_detection_ex.cpp.

    Also, users who are just learning about dlib's deep learning API should read
    the dnn_introduction_ex.cpp and dnn_introduction2_ex.cpp examples to learn
    how the API works.  For an introduction to the object detection method you
    should read dnn_mmod_ex.cpp

    TRAINING THE MODEL
        Finally, users interested in how the face detector was trained should
        read the dnn_mmod_ex.cpp example program.  It should be noted that the
        face detector used in this example uses a bigger training dataset and
        larger CNN architecture than what is shown in dnn_mmod_ex.cpp, but
        otherwise training is the same.  If you compare the net_type statements
        in this file and dnn_mmod_ex.cpp you will see that they are very similar
        except that the number of parameters has been increased.

        Additionally, the following training parameters were different during
        training: The following lines in dnn_mmod_ex.cpp were changed from
            mmod_options options(face_boxes_train, 40*40);
            trainer.set_iterations_without_progress_threshold(300);
        to the following when training the model used in this example:
            mmod_options options(face_boxes_train, 80*80);
            trainer.set_iterations_without_progress_threshold(8000);

        Also, the random_cropper was left at its default settings,  So we didn't
        call these functions:
            cropper.set_chip_dims(200, 200);
            cropper.set_min_object_height(0.2);

        The training data used to create the model is also available at 
        http://dlib.net/files/data/dlib_face_detection_dataset-2016-09-30.tar.gz
*/

#include <iostream>
#include <dlib/dnn.h>
#include <dlib/data_io.h>
#include <dlib/image_processing.h>
#include <dlib/gui_widgets.h>
#include <stdio.h>
#include <chrono>

using namespace std;
using namespace dlib;

// ----------------------------------------------------------------------------------------

template <long num_filters, typename SUBNET> using con5d = con<num_filters,5,5,2,2,SUBNET>;
template <long num_filters, typename SUBNET> using con5  = con<num_filters,5,5,1,1,SUBNET>;

template <typename SUBNET> using downsampler  = relu<affine<con5d<32, relu<affine<con5d<32, relu<affine<con5d<16,SUBNET>>>>>>>>>;
template <typename SUBNET> using rcon5  = relu<affine<con5<45,SUBNET>>>;

using net_type = loss_mmod<con<1,9,9,1,1,rcon5<rcon5<rcon5<downsampler<input_rgb_image_pyramid<pyramid_down<6>>>>>>>>;

// ----------------------------------------------------------------------------------------

int main(int argc, char** argv) try
{
    if (argc == 1)
    {
        cout << "Call this program like this:" << endl;
        cout << "./dnn_mmod_face_detection_ex mmod_human_face_detector.dat faces/*.jpg" << endl;
        cout << "\nYou can get the mmod_human_face_detector.dat file from:\n";
        cout << "http://dlib.net/files/mmod_human_face_detector.dat.bz2" << endl;
        return 0;
    }

    net_type net;
    deserialize(argv[1]) >> net;  
    int total_files = 0;
    int face_files = 0;
   // image_window win;
    for (int i = 2; i < argc; ++i)
    {
    total_files ++;
        matrix<rgb_pixel> img;
        load_image(img, argv[i]);

        // Upsampling the image will allow us to detect smaller faces but will cause the
        // program to use more RAM and run longer.
        // while(img.size() < 1800*1800)
        //     pyramid_up(img);

        // Note that you can process a bunch of images in a std::vector at once and it runs
        // much faster, since this will form mini-batches of images and therefore get
        // better parallelism out of your GPU hardware.  However, all the images must be
        // the same size.  To avoid this requirement on images being the same size we
        // process them individually in this example.
        auto begin = chrono::high_resolution_clock::now();
        auto dets = net(img);
        auto end = chrono::high_resolution_clock::now();    
        auto dur = end - begin;
        auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(dur).count();
        cout << ms << endl;
    if (dets.size() > 0) {
                face_files++;
            }

        if (total_files % 1 == 0)
            cout<<total_files<<"\n";

    //   win.clear_overlay();
       // win.set_image(img);
       // for (auto&& d : dets)
         //   win.add_overlay(d);

        //cout << "Hit enter to process the next image." << endl;
       // cin.get();
    }
    cout<<"Total Images: "<<total_files;
    cout<<"\nFace Images: "<<face_files<<"\n";
}
catch(std::exception& e)
{
    cout << e.what() << endl;
}
davisking commented 7 years ago

It depends on the size of the face you want to find. The detector will only find faces that are bigger than about 80x80 pixels. So if you want to find smaller faces you have to upsample the image. The total image size is irrelevant.

fenollp commented 7 years ago

@davisking And with the HOG detector, how do I know which minimum size the faces need to be?

davisking commented 7 years ago

It's about the same. Also, when you run it you see what it does :)

fengpingsh commented 6 years ago

Does DNN face detection work with Intel movidius which I want to accelerate the calculation?

davisking commented 6 years ago

Never heard of movidius. So probably not, unless the acceleration is via a BLAS interface or CUDA.

yerzhik commented 5 years ago

I'm running dlib at GPU (can confirm that) using cnn detector. It is taking 40 ms to run the detection code for one image of size 640x480x1 (grayscale): auto dets = net(matrix); No Upsampling is done (upsample_num_times is zero)

I wonder if that's OK or too slow? gtx 1070 is used.

the frontal_face_detector detect method does detection in 90 ms.

davisking commented 5 years ago

That sounds fine

On Nov 11, 2019, at 4:49 AM, yerzhik notifications@github.com wrote:

 I'm running dlib at GPU (can confirm that) using cnn detector. It is taking 40 ms to run the detection code for one image of size 640x480x1 (grayscale): auto dets = net(matrix); No Upsampling is done (upsample_num_times is zero)

I wonder if that's OK or too slow? gtx 1070 is used.

the frontal_face_detector detect method does detection in 90 ms.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

yerzhik commented 5 years ago

That sounds fine On Nov 11, 2019, at 4:49 AM, yerzhik @.***> wrote:  I'm running dlib at GPU (can confirm that) using cnn detector. It is taking 40 ms to run the detection code for one image of size 640x480x1 (grayscale): auto dets = net(matrix); No Upsampling is done (upsample_num_times is zero) I wonder if that's OK or too slow? gtx 1070 is used. the frontal_face_detector detect method does detection in 90 ms. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Thank you, I just thought gpu would be faster more than by 2x.