Closed pythonanonuser closed 7 years ago
What speeds are you getting?
Also, I don't think the K80 is as fast as a titan x, which is what was used to generate the timing numbers in the dlib blog, which I assume is what you are referring to. Also, you aren't timing the whole program's execution are you?
@davisking I'm just timing the line auto dets = net(img);
and I've commented out all the window display code. Here are the times I'm getting (in ms) for the images in the faces folder in the examples directory:
2990, 2502, 2698, 1753, 2385, 2678, 906, 2212, 2836, 900
If those are the times then I don't see how it could possibly be running on a K80. I bet you are actually running on a CPU, and you don't even have any BLAS library installed. How do you know it's using CUDA?
@davisking Just FYI for those running times I did not comment out upsampling the image code present in the example program. The only thing I changed was commenting out the window display code.
I'm not sure how to verify that the program is using CUDA, but based on the cmake output during the dlib installation I know dlib linked to mkl and CUDA 8.0 and cuDNN 5.1. All the CUDA stuff was built as well.
If you want to compare to the times in the blog post you need to use images of the same size. You can use nvidia-smi to see if you are actually running anything on the gpu.
@davisking I verified with nvidia-smi
that the process is using the GPU.
I took a look at the some of the pictures in the faces directory. They seem to be roughly 640x480. I feel like the times I posted are too slow even for slightly larger images and possibly a slightly slower GPU. Is there anything I'm missing. I'd like to run the detector on a couple thousand images.
The example program upsamples the images until they are at least 1800x1800, which is hugely different from 640x480. Also, are you timing multiple calls to the network? The first call has startup overhead because CUDA has a long startup sequence where it hooks into the drivers and sets up the CUDA runtime environment.
@davisking commenting out the upsampling lines lead to the following run times in ms:
215, 56, 57, 37, 51, 56, 63, 55, 57, 173
These times seem much more reasonable. However as expected, without upsampling it only found faces in 4/10 images instead of the actual 9/10. Is upsampling to 1800x1800 really necessary for full fledged accuracy of this detector? Is there a smaller size I could upsample to or is 1800x1800 my safest bet here?
For your reference, I've posted the exact code I ran. Let me know if I've missed something
// The contents of this file are in the public domain. See LICENSE_FOR_EXAMPLE_PROGRAMS.txt
/*
This example shows how to run a CNN based face detector using dlib. The
example loads a pretrained model and uses it to find faces in images. The
CNN model is much more accurate than the HOG based model shown in the
face_detection_ex.cpp example, but takes much more computational power to
run, and is meant to be executed on a GPU to attain reasonable speed. For
example, on a NVIDIA Titan X GPU, this example program processes images at
about the same speed as face_detection_ex.cpp.
Also, users who are just learning about dlib's deep learning API should read
the dnn_introduction_ex.cpp and dnn_introduction2_ex.cpp examples to learn
how the API works. For an introduction to the object detection method you
should read dnn_mmod_ex.cpp
TRAINING THE MODEL
Finally, users interested in how the face detector was trained should
read the dnn_mmod_ex.cpp example program. It should be noted that the
face detector used in this example uses a bigger training dataset and
larger CNN architecture than what is shown in dnn_mmod_ex.cpp, but
otherwise training is the same. If you compare the net_type statements
in this file and dnn_mmod_ex.cpp you will see that they are very similar
except that the number of parameters has been increased.
Additionally, the following training parameters were different during
training: The following lines in dnn_mmod_ex.cpp were changed from
mmod_options options(face_boxes_train, 40*40);
trainer.set_iterations_without_progress_threshold(300);
to the following when training the model used in this example:
mmod_options options(face_boxes_train, 80*80);
trainer.set_iterations_without_progress_threshold(8000);
Also, the random_cropper was left at its default settings, So we didn't
call these functions:
cropper.set_chip_dims(200, 200);
cropper.set_min_object_height(0.2);
The training data used to create the model is also available at
http://dlib.net/files/data/dlib_face_detection_dataset-2016-09-30.tar.gz
*/
#include <iostream>
#include <dlib/dnn.h>
#include <dlib/data_io.h>
#include <dlib/image_processing.h>
#include <dlib/gui_widgets.h>
#include <stdio.h>
#include <chrono>
using namespace std;
using namespace dlib;
// ----------------------------------------------------------------------------------------
template <long num_filters, typename SUBNET> using con5d = con<num_filters,5,5,2,2,SUBNET>;
template <long num_filters, typename SUBNET> using con5 = con<num_filters,5,5,1,1,SUBNET>;
template <typename SUBNET> using downsampler = relu<affine<con5d<32, relu<affine<con5d<32, relu<affine<con5d<16,SUBNET>>>>>>>>>;
template <typename SUBNET> using rcon5 = relu<affine<con5<45,SUBNET>>>;
using net_type = loss_mmod<con<1,9,9,1,1,rcon5<rcon5<rcon5<downsampler<input_rgb_image_pyramid<pyramid_down<6>>>>>>>>;
// ----------------------------------------------------------------------------------------
int main(int argc, char** argv) try
{
if (argc == 1)
{
cout << "Call this program like this:" << endl;
cout << "./dnn_mmod_face_detection_ex mmod_human_face_detector.dat faces/*.jpg" << endl;
cout << "\nYou can get the mmod_human_face_detector.dat file from:\n";
cout << "http://dlib.net/files/mmod_human_face_detector.dat.bz2" << endl;
return 0;
}
net_type net;
deserialize(argv[1]) >> net;
int total_files = 0;
int face_files = 0;
// image_window win;
for (int i = 2; i < argc; ++i)
{
total_files ++;
matrix<rgb_pixel> img;
load_image(img, argv[i]);
// Upsampling the image will allow us to detect smaller faces but will cause the
// program to use more RAM and run longer.
// while(img.size() < 1800*1800)
// pyramid_up(img);
// Note that you can process a bunch of images in a std::vector at once and it runs
// much faster, since this will form mini-batches of images and therefore get
// better parallelism out of your GPU hardware. However, all the images must be
// the same size. To avoid this requirement on images being the same size we
// process them individually in this example.
auto begin = chrono::high_resolution_clock::now();
auto dets = net(img);
auto end = chrono::high_resolution_clock::now();
auto dur = end - begin;
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(dur).count();
cout << ms << endl;
if (dets.size() > 0) {
face_files++;
}
if (total_files % 1 == 0)
cout<<total_files<<"\n";
// win.clear_overlay();
// win.set_image(img);
// for (auto&& d : dets)
// win.add_overlay(d);
//cout << "Hit enter to process the next image." << endl;
// cin.get();
}
cout<<"Total Images: "<<total_files;
cout<<"\nFace Images: "<<face_files<<"\n";
}
catch(std::exception& e)
{
cout << e.what() << endl;
}
It depends on the size of the face you want to find. The detector will only find faces that are bigger than about 80x80 pixels. So if you want to find smaller faces you have to upsample the image. The total image size is irrelevant.
@davisking And with the HOG detector, how do I know which minimum size the faces need to be?
It's about the same. Also, when you run it you see what it does :)
Does DNN face detection work with Intel movidius which I want to accelerate the calculation?
Never heard of movidius. So probably not, unless the acceleration is via a BLAS interface or CUDA.
I'm running dlib at GPU (can confirm that) using cnn detector. It is taking 40 ms to run the detection code for one image of size 640x480x1 (grayscale):
auto dets = net(matrix);
No Upsampling is done (upsample_num_times is zero)
I wonder if that's OK or too slow? gtx 1070 is used.
the frontal_face_detector detect method does detection in 90 ms.
That sounds fine
On Nov 11, 2019, at 4:49 AM, yerzhik notifications@github.com wrote:
I'm running dlib at GPU (can confirm that) using cnn detector. It is taking 40 ms to run the detection code for one image of size 640x480x1 (grayscale): auto dets = net(matrix); No Upsampling is done (upsample_num_times is zero)
I wonder if that's OK or too slow? gtx 1070 is used.
the frontal_face_detector detect method does detection in 90 ms.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
That sounds fine … On Nov 11, 2019, at 4:49 AM, yerzhik @.***> wrote: I'm running dlib at GPU (can confirm that) using cnn detector. It is taking 40 ms to run the detection code for one image of size 640x480x1 (grayscale): auto dets = net(matrix); No Upsampling is done (upsample_num_times is zero) I wonder if that's OK or too slow? gtx 1070 is used. the frontal_face_detector detect method does detection in 90 ms. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Thank you, I just thought gpu would be faster more than by 2x.
I've been playing around with the new face detector in /examples/dnn_mmod_face_detection_ex.cpp. I ran the program on the /examples/faces directory exactly as written in the example file itself, but I'm noticing the speed is nowhere near 45-50 ms per image.
I've enabled AVX instructions, compiled in release mode, and I'm testing on a NVIDIA Tesla K80 with CUDA 8.0 and cuDNN 5.1. I noticed that after commenting out the upsampling lines, the speed improves, but still nowhere to the tune of 50 ms per image.
Is there something missing in the example code that I need? Your help is appreciated, thanks!