elucideye / acf

Aggregated Channel Feature object detection in C++ and OpenGL ES 2.0 based on https://github.com/pdollar/toolbox
BSD 3-Clause "New" or "Revised" License
49 stars 20 forks source link

Inference time issue #13

Closed SoonminHwang closed 7 years ago

SoonminHwang commented 7 years ago

Thanks for quick reply!

I tried to change compiler which supports std:regex such as gcc-4.9 or clang-3.5 & libcxx. But the polly.py seems not to support gcc-4.9. (I cannot find gcc-4-9 in the list when I type polly.py --help)

In the case of libcxx toolchain, I failed to build with some error messages. Here is log file.

Anyway, my first goal is to compare running time to piotr's matlab implementation. I commented cxxopts things in acf.cpp and measure inference time using gettimeofday function.

Even though the inference time of the classifier heavily depends on the image content and casc thershold, somthings are wrong. It takes 54ms for lena512color.png using drishti_face_gray_80x80.cpb. (As you know, ~100ms in piotr's MATLAB code for 640x480 image)

I expect <1ms with my GPU (Titan X Pascal).

I think, I turn on the flag to use GPU. https://github.com/elucideye/acf/blob/17c26894c8d597bb5a33f816167f16d5abe4fe67/CMakeLists.txt#L91

How about the inference time on your machine?

headupinclouds commented 7 years ago

But the polly.py seems not to support gcc-4.9.

I'm using a gcc 5 toolchain in the travis tests, which is working fine:

https://github.com/elucideye/acf/blob/17c26894c8d597bb5a33f816167f16d5abe4fe67/.travis.yml#L46

If you need gcc 4.9 it should be easy to create one from this recipe:

https://github.com/ruslo/polly/blob/master/gcc-5-pic-hid-sections-lto.cmake

headupinclouds commented 7 years ago

In the case of libcxx toolchain, I failed to build with some error messages. Here is log file.

I will try adding libcxx to the CI tests: https://github.com/elucideye/acf/pull/14


UPDATE: Clang 3.8 is now building fine in the travis Unbutu Trusty (14.04) image:

https://travis-ci.org/elucideye/acf/jobs/287454354ttps://travis-ci.org/elucideye/acf/jobs/287454354

@SoonminHwang :point_up:

headupinclouds commented 7 years ago

It takes 54ms for lena512color.png using drishti_face_gray_80x80.cpb. (As you know, ~100ms in piotr's MATLAB code for 640x480 image). I expect <1ms with my GPU (Titan X Pascal).

TL;DR: The shader implementation is geared towards optimized feature computation on mobile GPUs. The detection itself doesn't map well to simple GLSL processing, so the features must be transfered from GPU->CPU (slow) for CPU based detection (fast). On a desktop, the full process could be executed on the GPU.

The console app doesn't currently use the OpenGL ES 2.0 shader acceleration, so I'm sure you are running a CPU only benchmark. I recently migrated this stuff from drishti for general purpose use and improvements, and it will be added to the Hunter package manger once it is cleaned up a little more. I originally needed this for mobile platforms, so OpenGL ES 2.0 was the lowest common denominator that could support both iOS and Android platforms. The main drawback with this approach is the 8 bit channel output limitation (it can be improved with 4x8 -> 32 bit packing). Caveat: Due the the above mentioned limitation, the GLSL output is currently only an approximation of the CPU floating point output, and it needs to be improved (there will be a measurable performance hit). For desktop use, it is probably better to write it in OpenCL or something higher level that doesn't have these limitations. (I recently came across Halide, which seems like an excellent path for cross platform optimization, but I currently have no experience with it.)

The GLSL code is all in this file https://github.com/elucideye/acf/blob/master/src/lib/acf/GPUACF.h, which is currently separate from the ACF detection class. To use that class, you will need to manage your own OpenGL context. It uses https://github.com/hunter-packages/ogles_gpgpu to manage a shader pipeline that computes the features. The expensive part on mobile platforms is the GPU->CPU transfer, so one frame of latency is added to the pipeline, such that ACF pyramids can be computed on the GPU for frame N ("for free"), and they are available for processing at time N+1 with no added CPU cost. In this workflow, the precomputed ACF pyramid is passed in for detection in place of the RGB image. The face detection/search on the precomputed pyramids then runs in a few milliseconds on an iPhone 7. For pedestrian detection the extra frame of latency might not be suitable. The SDK call is shown here:

https://github.com/elucideye/acf/blob/17c26894c8d597bb5a33f816167f16d5abe4fe67/src/lib/acf/ACF.h#L392

There is a small unit test that illustrates what the basic process would look like: 1) compute acf::Detector::Pyramid objects on the GPU and then; 2) feed them to acf::Detector:

https://github.com/elucideye/acf/blob/17c26894c8d597bb5a33f816167f16d5abe4fe67/src/lib/acf/ut/test-acf.cpp#L444-L464

The above test uses the Hunter aglet package to manage the OpenGL context (just glfw for PC builds).

That test could be used for some initial benchmarks, and perhaps it could be added to the console application for additional testing. I'll try to take a look in the next few days, unless you want to try it sooner.

It would be nice to automate the GPGPU processing at the API level. Actually there was an issue for this https://github.com/elucideye/drishti/issues/373 here. I'll migrate it to the new repository. A cv::UMat OpenCL interface would be cool, but I'm primarily interested in mobile platforms where this isn't really an option.

headupinclouds commented 7 years ago

As a temporary GPU benchmark, I've added a timer class that can be enabled w/ an option in the unit test. The is currently sitting in this PR: https://github.com/elucideye/acf/pull/16

See define ACF_LOG_GPU_TIME 1 here: https://github.com/elucideye/acf/pull/16/commits/0b95cb712648ea92a0958dfb82927d503d549429#r144582514

That will print the GPGPU pyramid compute (shaders, read, and "fill" to memory)

ACF::fill(): 512 0.00225127

As well as the detection time

 acf::Detector::operator():0.00207684

On my desktop these each take about 2 milliseconds (2+2=4 ms) with a GEFORCE GTX TITAN X. The detection time is comparable on my 2013 MacBook, but the fill operation is more like 20 ms (nearly all of that is spent in the texture read). Actually if we use the PBO reads in ogles_gpgpu in the OpenGL ES 3.0 compatibility mode (or desktop) that should be faster. I'll add an option for it.

This isn't a proper benchmark, but it can provide some info in the short term.

1: Test timeout computed to be: 9.99988e+06
1: [==========] Running 7 tests from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 7 tests from ACFTest
1: [ RUN      ] ACFTest.ACFSerializeCereal
1: [       OK ] ACFTest.ACFSerializeCereal (334 ms)
1: [ RUN      ] ACFTest.ACFDetectionCPUMat
1: [       OK ] ACFTest.ACFDetectionCPUMat (106 ms)
1: [ RUN      ] ACFTest.ACFDetectionCPUMatP
1: [       OK ] ACFTest.ACFDetectionCPUMatP (84 ms)
1: [ RUN      ] ACFTest.ACFChannelsCPU
1: [       OK ] ACFTest.ACFChannelsCPU (89 ms)
1: [ RUN      ] ACFTest.ACFPyramidCPU
1: [       OK ] ACFTest.ACFPyramidCPU (101 ms)
1: [ RUN      ] ACFTest.ACFPyramidGPU10
1: ACF::fill(): 512 0.00246045
1: [       OK ] ACFTest.ACFPyramidGPU10 (305 ms)
1: [ RUN      ] ACFTest.ACFDetectionGPU10
1: ACF::fill(): 512 0.00225127
1: acf::Detector::operator():0.00207684
1: [       OK ] ACFTest.ACFDetectionGPU10 (153 ms)
1: [----------] 7 tests from ACFTest (1173 ms total)
1:
1: [----------] Global test environment tear-down
1: [==========] 7 tests from 1 test case ran. (1173 ms total)
1: [  PASSED  ] 7 tests.
1/1 Test #1: ACFTest ..........................   Passed    1.19 sec
headupinclouds commented 7 years ago

@SoonminHwang : I hope this answers your question. I'm going to close this for now. Since one of the strong advantages of this packages is size + speed, it might make sense to add some targeted google benchmarks.