davisking / dlib

A toolkit for making real world machine learning and data analysis applications in C++
http://dlib.net
Boost Software License 1.0
13.59k stars 3.38k forks source link

detector too slow on the TK1 board #557

Closed gangm closed 6 years ago

gangm commented 7 years ago

hello,

Recently, we use dlib on our TK1(arm) board, but seems it take too long(about 3s) to detect one face in the picture.

We use 'pip install dlib' to install, and have a test used below code:

detector = dlib.get_frontal_face_detector() img = io.imread("/home/ubuntu/face.jpg") for i in range(1000): dets = detector(img, 1) print("Number of faces detected: {}".format(len(dets)))

And it take about 3s to detect one picture, do you know where is wrong? how to fix it? thanks~ Is the blas library impact so much?

davisking commented 7 years ago

Are you timing that whole program?

e-fominov commented 7 years ago

dets = detector(img, 1)

first try changing this to dets = detector(img, 0)

Next step is to use NEON optimizations. It is discussed here: https://github.com/davisking/dlib/issues/276 Some other possibility is to run partial face detection (only frontal faces) - this will make it run about 2x faster with some face missing. You can try reading this for more info

TK1's CPU is quite slow and the whole idea of TK1 is to use GPU for all processing tasks. Dlib does not support FHOG detectors on GPU, but there are some in OpenCV

And the one more problem of TK1 - 32-bit architecture, so max CUDA version is 6.5 for it. And Dlib require at least 7.5 CUDA version

Switching to Jetson TX1/2 is required to run Dlib's DNN algorithms

gangm commented 7 years ago

@e-fominov thanks for the detailed reply.

we tried dets = detector(img, 0), and yes, it run about 3X faster(about 1s each face detection).

but it is also slow, we tried to run dlib in our PC (with CPU about 4.2G), and the speed is about 20ms each face detection.

the TK1's CPU is about 2.3G, but the speed is so slow(50 times the gap), we suspect that some configuration is wrong, but we don't know how to debug it, do you have any proposal? thanks so much~

gangm commented 7 years ago

@e-fominov hello,

we tried to use NEON then, and it is faster. it took about 700ms to detect one 400*600 pectures, i think it is also a little slow.

our c++ code: frontal_face_detector detector = get_frontal_face_detector(); load_image(img, argv[i]); for(int count=0; count < 100; count++) { double t1 = cv::getTickCount(); cout << "start to detect..." << endl; std::vector dets = detector(img); cout << "Number of faces detected: " << dets.size() << endl; double t2 = cv::getTickCount(); std::cout << "Read time: " << (t2 - t1) * 1000 / cv::getTickFrequency() << " ms." << std::endl; }

the result: start to detect... Number of faces detected: 1 Read time: 767.798 ms. start to detect... Number of faces detected: 1 Read time: 762.339 ms. start to detect... Number of faces detected: 1 Read time: 769.372 ms.

gangm commented 7 years ago

@davisking
Are you timing that whole program? --no, just one loop, now it take about 1s to detect one picture(400*600).

e-fominov commented 7 years ago

400x600 is quite small resolution, I think no need to try smaller images Next is NEON question. This is not something oficially supported and should be double-checked Check also this doc for Jetson CPU speed tuning http://elinux.org/Jetson/Performance

Also possible optimizations are not to use pyramid and use only frontal detector

        typedef dlib::scan_fhog_pyramid<dlib::pyramid_down<6>, dlib::fhog_feature_extractor > image_scanner_type;
        image_scanner_type scanner;
        scanner.copy_configuration(detector.get_scanner());
        scanner.set_max_pyramid_levels(1);
        detector = dlib::object_detector<image_scanner_type>(scanner, detector.get_overlap_tester(), detector.get_w());

This new detector will work about 4x faster, but will miss frontal faces and will detect only a limited face size range (about 80 pixels size) But this is general optimization and they will work on PC too, while 50x gap is something very different. I assume that TK1 has 2x less CPU frequency, so the gap comes to 25x, then SIMD - they should give about 2x-4x performance improvement, and the rest is possible architecture differences, memory speed and bandwidth To understand the real situation I recommend you to measure face detection stages separate. First stage is FHOG features extraction:

        dlib::array<dlib::array2d<double>> hog;
        dlib::impl_fhog::impl_extract_fhog_features(img, hog, 8, 1, 1);

The real way how to make face detection work on Tegra TK1 well is to rewrite the code into CUDA - this is the main idea of all Jetsons

e-fominov commented 7 years ago

I want to make some optimizations for TK1. @gangm , can you measure the time of this code on your TK1?

    matrix<unsigned char, 1536, 2048> img;
    dlib::array<array2d<double>> hog;
    chrono::system_clock::time_point start = chrono::system_clock::now();
    impl_fhog::impl_extract_fhog_features(img, hog, 8, 1, 1);
    chrono::system_clock::time_point end = chrono::system_clock::now();
    double msec = std::chrono::duration_cast<chrono::milliseconds>(end - start).count();
    cout << msec << endl;

My timings:

  1. i7/2.2Ghz (MinGW 6.3/Windows 7x64) AVX = ~45ms
  2. i7/2.2Ghz (MinGW 6.3/Windows 7x64) SSE2 = ~100 ms
  3. TK1 (GCC6.2) : ~12300 ms
  4. TK1 (GCC6.2) -Ofast: ~1100 ms
  5. TK1 (GCC6.2) + NEON = ~300 ms
bakercp commented 7 years ago

My full test code and compiler settings are here.

Results with some Intel/AVX for reference:

Intel (Ivy Bridge)

i7/2.3GHz (Apple LLVM version 8.1.0/macOS 10.12.4)

Run Flags Duration (ms) Notes
1. -O3 ~59 Compiled, ran.
2. -O3 -mavx ~45 Compiled, ran.

Intel (Kaby Lake)

i7/4.2GHz (g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4))

Run Flags Duration (ms) Notes
3. -O3 ~32 Compiled, ran.
4. -O3 -mavx ~27 Compiled, ran.

Raspberry Pi 3 Model B [rev. a02082] (circa 2016)

armv7/1.2GHz (g++ (Raspbian 4.9.2-10/Raspbian))

Run Flags Duration (ms) Notes
5. -O3 ~2904 Compiled, ran.
6. -O3 -mfpu=neon ~1267 Compiled, ran.

Raspberry Pi 1 Model B [rev. 0002] (circa 2012)

[no vfpv2/vfpv3/vfpv3-neon support] armv6/0.7GHz (g++ (Raspbian 4.9.2-10))

Run Flags Duration (ms) Notes
7. -O3 ~7510.9 Compiled, ran.
8. -O3 -mfpu=neon 👎 Illegal instruction error @ runtime.
9. -O3 -mfpu=vfp ~7550.0 Compiled, ran.
e-fominov commented 7 years ago

@bakercp , Thanks for detailed measurements. Looks like NEON support works now If I try to compare Pi 3 (1267) and TK (300) timings, I see TK1 2x higher CPU frequency (2.3 Ghz) and remaining 2x performance gap. This can be caused by processor architecture difference, different memory, etc.

And I have successfully improved my results about 2x with profile-guided optimizations. Can you try them on Pi 3?

"-O3 -mfpu=neon -fprofile-generate" then run application and re-compile with "-O3 -mfpu=neon -fprofile-use" then measure time

my measuremenst on TK1 shows time reduction from 300ms to 150 ms and no visible improvements on i7

bakercp commented 7 years ago

My test code and compiler settings are here.

Updated RPI3 measurements:

Raspberry Pi 3 Model B [rev. a02082] (circa 2016)

armv7/1.2GHz (g++ (Raspbian 4.9.2-10/Raspbian))

Run Flags Duration (ms) Notes
5. -O3 ~2904 Compiled, ran.
6. -O3 -mfpu=neon ~1267 Compiled, ran.
10a. -O3 -mfpu=neon -fprofile-generate ~5600 Compiled, ran.
10b. -O3 -mfpu=neon -fprofile-use ~444 Did 10a, then compiled, ran.

Wow! 🥇

jaglanaccess commented 7 years ago

Hi! I am trying to also apply profile generated optimization on a Allwinner H3 armv7/1.2GHz but unfortunately I cannot generate inside the device the .gcda file used for the second compilation. In my case it also takes around 1250 ms to find a face on a 720x540 image resolution using also NEON.

These are the flags used in the Android.mk file

LOCAL_LDFLAGS += --coverage -fprofile-generate=/sdcard/profile LOCAL_CFLAGS += --coverage -fprofile-generate=/sdcard/profile

Manifest file includes android.permission.ACCESS_SUPERUSER

The .gcno files are correctly generated in local application folder.

Also I am using NDK r10d and toolchain version 4.8

Any advice?

openedhardware commented 7 years ago

@e-fominov

And I have successfully improved my results about 2x with profile-guided optimizations. "-O3 -mfpu=neon -fprofile-generate" then run application and re-compile with "-O3 -mfpu=neon -fprofile-use" then measure time

Does this work with python API?

Should I compile dlib twice? Something like this? sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-generate"

Run my python code.

sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-use"

And re-run my python code again?

veeraharin commented 7 years ago

hi, i tried with all the options below, but i couldn't get output time ~444 sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-generate" sudo python setup.py install --compiler-flags "-O3 -mfpu=neon -fprofile-use"

please help....... Thanks & Regards R . harin

jamesweb1 commented 6 years ago

I cannot achieve the performance @bakercp tested. Should I need to remove the dlib first? What's the correct command I should use in compile?

I use
sudo python setup.py install --compiler-flags "-mfpu=neon -fprofile-use"

And I recognize the face_locations in Python in the picture with resolution 640*480 more than 5 seconds...

(ps: face_detector = dlib.get_frontal_face_detector())

xiongyihui commented 6 years ago

https://stackoverflow.com/questions/4365980/how-to-use-profile-guided-optimizations-in-g has a discussion of -fprofile-generate and -fprofile-use. It may help.

See also https://en.wikipedia.org/wiki/Profile-guided_optimization

dlib-issue-bot commented 6 years ago

Warning: this issue has been inactive for 257 days and will be automatically closed on 2018-09-07 if there is no further activity.

If you are waiting for a response but haven't received one it's likely your question is somehow inappropriate. E.g. you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's documentation, or a Google search.

dlib-issue-bot commented 6 years ago

Notice: this issue has been closed because it has been inactive for 261 days. You may reopen this issue if it has been closed in error.

MyraBaba commented 6 years ago

Hi I have below numbers for profiling benefits;

in raspi 3 B

I tested :

g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-generate -I.. -lpthread fhog_simd_ex.cpp

389.39 ms

Then

g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-use -I.. -lpthread fhog_simd_ex.cpp

~378.82 ms

Not huge improvement

How can I sure -fprofile-generate and -fprofile-use took place ?

I am trying to accelerate face detection and extraction Face_ID (128D) speed.

PS: face_deetction and the Face_decsriptor(128D) extract almost took same time period. Best

davisking commented 6 years ago

Did you run the program between profile generation and use phases? You need to do that.

MyraBaba commented 6 years ago

did you mean that I need to run program after : g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-generate -I.. -lpthread fhog_simd_ex.cpp

than : g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-use -I.. -lpthread fhog_simd_ex.cpp

right ?

I just execute profiling than profile_use finally run the app.

I will test and post here soon...

davisking commented 6 years ago

Yes. You need to read about what profiling is. The gcc documentation explains it.

MyraBaba commented 6 years ago

Tested in Raspberry pi 3b+:

Result is better with -O3 . without profiling.

profiling didnt improve

nizqsut commented 5 years ago

@e-fominov hello,

we tried to use NEON then, and it is faster. it took about 700ms to detect one 400*600 pectures, i think it is also a little slow.

our c++ code: frontal_face_detector detector = get_frontal_face_detector(); load_image(img, argv[i]); for(int count=0; count < 100; count++) { double t1 = cv::getTickCount(); cout << "start to detect..." << endl; std::vector dets = detector(img); cout << "Number of faces detected: " << dets.size() << endl; double t2 = cv::getTickCount(); std::cout << "Read time: " << (t2 - t1) * 1000 / cv::getTickFrequency() << " ms." << std::endl; }

the result: start to detect... Number of faces detected: 1 Read time: 767.798 ms. start to detect... Number of faces detected: 1 Read time: 762.339 ms. start to detect... Number of faces detected: 1 Read time: 769.372 ms.

How you use NEON on Rasperry PI 3B+? I meet the same problem?

MyraBaba commented 5 years ago

@nizqsut

run program after : g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-generate -I.. -lpthread fhog_simd_ex.cpp

then : g++ -DDLIB_NO_GUI_SUPPORT -DNO_MAKEFILE -std=c++11 -O3 -mfpu=neon -fprofile-use -I.. -lpthread fhog_simd_ex.cpp