Closed gangm closed 6 years ago
Are you timing that whole program?
dets = detector(img, 1)
first try changing this to dets = detector(img, 0)
Next step is to use NEON optimizations. It is discussed here: https://github.com/davisking/dlib/issues/276 Some other possibility is to run partial face detection (only frontal faces) - this will make it run about 2x faster with some face missing. You can try reading this for more info
TK1's CPU is quite slow and the whole idea of TK1 is to use GPU for all processing tasks. Dlib does not support FHOG detectors on GPU, but there are some in OpenCV
And the one more problem of TK1 - 32-bit architecture, so max CUDA version is 6.5 for it. And Dlib require at least 7.5 CUDA version
Switching to Jetson TX1/2 is required to run Dlib's DNN algorithms
@e-fominov thanks for the detailed reply.
we tried dets = detector(img, 0), and yes, it run about 3X faster(about 1s each face detection).
but it is also slow, we tried to run dlib in our PC (with CPU about 4.2G), and the speed is about 20ms each face detection.
the TK1's CPU is about 2.3G, but the speed is so slow(50 times the gap), we suspect that some configuration is wrong, but we don't know how to debug it, do you have any proposal? thanks so much~
@e-fominov hello,
we tried to use NEON then, and it is faster. it took about 700ms to detect one 400*600 pectures, i think it is also a little slow.
our c++ code: frontal_face_detector detector = get_frontal_face_detector(); load_image(img, argv[i]); for(int count=0; count < 100; count++) { double t1 = cv::getTickCount(); cout << "start to detect..." << endl; std::vector dets = detector(img); cout << "Number of faces detected: " << dets.size() << endl; double t2 = cv::getTickCount(); std::cout << "Read time: " << (t2 - t1) * 1000 / cv::getTickFrequency() << " ms." << std::endl; }
the result: start to detect... Number of faces detected: 1 Read time: 767.798 ms. start to detect... Number of faces detected: 1 Read time: 762.339 ms. start to detect... Number of faces detected: 1 Read time: 769.372 ms.
@davisking
Are you timing that whole program?
--no, just one loop, now it take about 1s to detect one picture(400*600).
400x600 is quite small resolution, I think no need to try smaller images Next is NEON question. This is not something oficially supported and should be double-checked Check also this doc for Jetson CPU speed tuning http://elinux.org/Jetson/Performance
Also possible optimizations are not to use pyramid and use only frontal detector
typedef dlib::scan_fhog_pyramid<dlib::pyramid_down<6>, dlib::fhog_feature_extractor > image_scanner_type;
image_scanner_type scanner;
scanner.copy_configuration(detector.get_scanner());
scanner.set_max_pyramid_levels(1);
detector = dlib::object_detector<image_scanner_type>(scanner, detector.get_overlap_tester(), detector.get_w());
This new detector will work about 4x faster, but will miss frontal faces and will detect only a limited face size range (about 80 pixels size) But this is general optimization and they will work on PC too, while 50x gap is something very different. I assume that TK1 has 2x less CPU frequency, so the gap comes to 25x, then SIMD - they should give about 2x-4x performance improvement, and the rest is possible architecture differences, memory speed and bandwidth To understand the real situation I recommend you to measure face detection stages separate. First stage is FHOG features extraction:
dlib::array<dlib::array2d<double>> hog;
dlib::impl_fhog::impl_extract_fhog_features(img, hog, 8, 1, 1);
The real way how to make face detection work on Tegra TK1 well is to rewrite the code into CUDA - this is the main idea of all Jetsons
I want to make some optimizations for TK1. @gangm , can you measure the time of this code on your TK1?
matrix<unsigned char, 1536, 2048> img;
dlib::array<array2d<double>> hog;
chrono::system_clock::time_point start = chrono::system_clock::now();
impl_fhog::impl_extract_fhog_features(img, hog, 8, 1, 1);
chrono::system_clock::time_point end = chrono::system_clock::now();
double msec = std::chrono::duration_cast<chrono::milliseconds>(end - start).count();
cout << msec << endl;
My timings:
My full test code and compiler settings are here.
Results with some Intel/AVX for reference:
i7/2.3GHz (Apple LLVM version 8.1.0/macOS 10.12.4)
Run | Flags | Duration (ms) | Notes |
---|---|---|---|
1. | -O3 |
~59 | Compiled, ran. |
2. | -O3 -mavx |
~45 | Compiled, ran. |
i7/4.2GHz (g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4))
Run | Flags | Duration (ms) | Notes |
---|---|---|---|
3. | -O3 |
~32 | Compiled, ran. |
4. | -O3 -mavx |
~27 | Compiled, ran. |
armv7/1.2GHz (g++ (Raspbian 4.9.2-10/Raspbian))
Run | Flags | Duration (ms) | Notes |
---|---|---|---|
5. | -O3 |
~2904 | Compiled, ran. |
6. | -O3 -mfpu=neon |
~1267 | Compiled, ran. |
[no vfpv2/vfpv3/vfpv3-neon support] armv6/0.7GHz (g++ (Raspbian 4.9.2-10))
Run | Flags | Duration (ms) | Notes |
---|---|---|---|
7. | -O3 |
~7510.9 | Compiled, ran. |
8. | -O3 -mfpu=neon |
👎 | Illegal instruction error @ runtime. |
9. | -O3 -mfpu=vfp |
~7550.0 | Compiled, ran. |
@bakercp , Thanks for detailed measurements. Looks like NEON support works now If I try to compare Pi 3 (1267) and TK (300) timings, I see TK1 2x higher CPU frequency (2.3 Ghz) and remaining 2x performance gap. This can be caused by processor architecture difference, different memory, etc.
And I have successfully improved my results about 2x with profile-guided optimizations. Can you try them on Pi 3?
"-O3 -mfpu=neon -fprofile-generate"
then run application and re-compile with
"-O3 -mfpu=neon -fprofile-use"
then measure time
my measuremenst on TK1 shows time reduction from 300ms to 150 ms and no visible improvements on i7
My test code and compiler settings are here.
Updated RPI3 measurements:
armv7/1.2GHz (g++ (Raspbian 4.9.2-10/Raspbian))
Run | Flags | Duration (ms) | Notes |
---|---|---|---|
5. | -O3 |
~2904 | Compiled, ran. |
6. | -O3 -mfpu=neon |
~1267 | Compiled, ran. |
10a. | -O3 -mfpu=neon -fprofile-generate |
~5600 | Compiled, ran. |
10b. | -O3 -mfpu=neon -fprofile-use |
~444 | Did 10a, then compiled, ran. |
Wow! 🥇
Hi! I am trying to also apply profile generated optimization on a Allwinner H3 armv7/1.2GHz but unfortunately I cannot generate inside the device the .gcda file used for the second compilation. In my case it also takes around 1250 ms to find a face on a 720x540 image resolution using also NEON.
These are the flags used in the Android.mk file
LOCAL_LDFLAGS += --coverage -fprofile-generate=/sdcard/profile LOCAL_CFLAGS += --coverage -fprofile-generate=/sdcard/profile
Manifest file includes android.permission.ACCESS_SUPERUSER
The .gcno files are correctly generated in local application folder.
Also I am using NDK r10d and toolchain version 4.8
Any advice?
@e-fominov
And I have successfully improved my results about 2x with profile-guided optimizations. "-O3 -mfpu=neon -fprofile-generate" then run application and re-compile with "-O3 -mfpu=neon -fprofile-use" then measure time
Does this work with python API?
Should I compile dlib twice? Something like this?
sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-generate"
Run my python code.
sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-use"
And re-run my python code again?
hi, i tried with all the options below, but i couldn't get output time ~444 sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-generate" sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-use"
please help....... Thanks & Regards R . harin
I cannot achieve the performance @bakercp tested. Should I need to remove the dlib first? What's the correct command I should use in compile?
I use
sudo python setup.py install --compiler-flags "-mfpu=neon -fprofile-use"
And I recognize the face_locations in Python in the picture with resolution 640*480 more than 5 seconds...
(ps: face_detector = dlib.get_frontal_face_detector()
)
https://stackoverflow.com/questions/4365980/how-to-use-profile-guided-optimizations-in-g has a discussion of -fprofile-generate
and -fprofile-use
. It may help.
See also https://en.wikipedia.org/wiki/Profile-guided_optimization
Warning: this issue has been inactive for 257 days and will be automatically closed on 2018-09-07 if there is no further activity.
If you are waiting for a response but haven't received one it's likely your question is somehow inappropriate. E.g. you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's documentation, or a Google search.
Notice: this issue has been closed because it has been inactive for 261 days. You may reopen this issue if it has been closed in error.
Hi I have below numbers for profiling benefits;
in raspi 3 B
I tested :
g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-generate -I.. -lpthread fhog_simd_ex.cpp
389.39 ms
Then
g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-use -I.. -lpthread fhog_simd_ex.cpp
~378.82 ms
Not huge improvement
How can I sure -fprofile-generate and -fprofile-use took place ?
I am trying to accelerate face detection and extraction Face_ID (128D) speed.
PS: face_deetction and the Face_decsriptor(128D) extract almost took same time period. Best
Did you run the program between profile generation and use phases? You need to do that.
did you mean that I need to run program after :
g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-generate -I.. -lpthread fhog_simd_ex.cpp
than : g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-use -I.. -lpthread fhog_simd_ex.cpp
right ?
I just execute profiling than profile_use finally run the app.
I will test and post here soon...
Yes. You need to read about what profiling is. The gcc documentation explains it.
Tested in Raspberry pi 3b+:
Result is better with -O3 . without profiling.
profiling didnt improve
@e-fominov hello,
we tried to use NEON then, and it is faster. it took about 700ms to detect one 400*600 pectures, i think it is also a little slow.
our c++ code: frontal_face_detector detector = get_frontal_face_detector(); load_image(img, argv[i]); for(int count=0; count < 100; count++) { double t1 = cv::getTickCount(); cout << "start to detect..." << endl; std::vector dets = detector(img); cout << "Number of faces detected: " << dets.size() << endl; double t2 = cv::getTickCount(); std::cout << "Read time: " << (t2 - t1) * 1000 / cv::getTickFrequency() << " ms." << std::endl; }
the result: start to detect... Number of faces detected: 1 Read time: 767.798 ms. start to detect... Number of faces detected: 1 Read time: 762.339 ms. start to detect... Number of faces detected: 1 Read time: 769.372 ms.
How you use NEON on Rasperry PI 3B+? I meet the same problem?
@nizqsut
run program after : g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-generate -I.. -lpthread fhog_simd_ex.cpp
then : g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-use -I.. -lpthread fhog_simd_ex.cpp
hello,
Recently, we use dlib on our TK1(arm) board, but seems it take too long(about 3s) to detect one face in the picture.
We use 'pip install dlib' to install, and have a test used below code:
detector = dlib.get_frontal_face_detector() img = io.imread("/home/ubuntu/face.jpg") for i in range(1000): dets = detector(img, 1) print("Number of faces detected: {}".format(len(dets)))
And it take about 3s to detect one picture, do you know where is wrong? how to fix it? thanks~ Is the blas library impact so much?