get_frontal_face_detector() not speed up by CUDA

Hi, Firstly, thank you for you library. It is very usefull.

Expected Behavior

I would like to do a face detection in real team. So I try to use Dlib with CUDA. I compile Dlib with CUDA and it works fine. I reduce the the research of face caracteristics by 100 This operation takes 5ms now : std::vector<matrix<float,0,1>> faceDescriptor = net(facesfiltered);

So I am suprised because "get_frontal_face_detector() " operation time is not reduce by CUDA. With the same picture, it takes 500ms without CUDA and 500ms with CUDA

Current Behavior

I try to find face face on a picture (1280/720px) with get_frontal_face_detector()

With Dlib and cuda : 543ms
With Dlib without cuda : 542 ms

CUDA not have effect on get_frontal_face_detector()

Steps to Reproduce

1 - I compile dlib19.14 without CUDA 2 - I try find faces :

#include <QCoreApplication>
#include <QTime>
#include <qDebug>
#include <dlib/dnn.h>
#include <dlib/gui_widgets.h>
#include <dlib/clustering.h>
#include <dlib/string.h>
#include <dlib/image_io.h>
#include <dlib/image_processing/frontal_face_detector.h>

using namespace dlib;
using namespace std;

template <template <int,template<typename>class,int,typename> class block, int N, template<typename>class BN, typename SUBNET>
using residual = add_prev1<block<N,BN,1,tag1<SUBNET>>>;

template <template <int,template<typename>class,int,typename> class block, int N, template<typename>class BN, typename SUBNET>
using residual_down = add_prev2<avg_pool<2,2,2,2,skip1<tag2<block<N,BN,2,tag1<SUBNET>>>>>>;

template <int N, template <typename> class BN, int stride, typename SUBNET>
using block  = BN<con<N,3,3,1,1,relu<BN<con<N,3,3,stride,stride,SUBNET>>>>>;

template <int N, typename SUBNET> using ares      = relu<residual<block,N,affine,SUBNET>>;
template <int N, typename SUBNET> using ares_down = relu<residual_down<block,N,affine,SUBNET>>;

template <typename SUBNET> using alevel0 = ares_down<256,SUBNET>;
template <typename SUBNET> using alevel1 = ares<256,ares<256,ares_down<256,SUBNET>>>;
template <typename SUBNET> using alevel2 = ares<128,ares<128,ares_down<128,SUBNET>>>;
template <typename SUBNET> using alevel3 = ares<64,ares<64,ares<64,ares_down<64,SUBNET>>>>;
template <typename SUBNET> using alevel4 = ares<32,ares<32,ares<32,SUBNET>>>;

using anet_type = loss_metric<fc_no_bias<128,avg_pool_everything<
                            alevel0<
                            alevel1<
                            alevel2<
                            alevel3<
                            alevel4<
                            max_pool<3,3,2,2,relu<affine<con<32,7,7,2,2,
                            input_rgb_image_sized<150>
                            >>>>>>>>>>>>;

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    frontal_face_detector detector = get_frontal_face_detector();

    shape_predictor sp;
    deserialize("C:/Users/jonas.gaudin/Documents/QTProject/DlibSpeedTest/shape_predictor_68_face_landmarks.dat") >> sp;
    anet_type net;
    deserialize("C:/Users/jonas.gaudin/Documents/QTProject/DlibSpeedTest/dlib_face_recognition_resnet_model_v1.dat") >> net;

    matrix<rgb_pixel> img;
    load_image(img, "C:/Users/jonas.gaudin/Pictures/source.jpg");

    QTime t1;
    t1.start();

    std::vector<matrix<rgb_pixel>> faces;
    auto detections = detector(img);

    qDebug() << t1.elapsed();
    return a.exec();
}

3 - Dlib find 2 faces in 545 ms (net takes 450ms) 4 - I compile Dlib with CUDA 5 - I run the same code 6 - Dlib find 2 faces in 543 ms ( but Dlib works with CUDA because net takes 5ms now)

Version: 19.14 compile with CUDA 9.2
Where did you get dlib: From dlib website
Platform: Windows 64-bit
Compiler: "Visual Studio 14 2015 Win64"

davisking / dlib