ygyin-ivy commented 6 years ago

I analyzed dependences of acf GLDetector through android studio clang compiler, and succeed to complie GLDetector through JNI .

but i test GLDetector's performance on lena512color.png , it's 1750.45 ms,

my complie option is

LOCAL_ARM_NEON := true LOCAL_CFLAGS += -O3 -mfloat-abi=softfp -mfpu=neon -march=armv7

how can i improve performance to reach 30 fps (about 30ms per frame)?

headupinclouds commented 6 years ago

I analyzed dependences of acf GLDetector through android studio clang compiler, and succeed to complie GLDetector through JNI .

This project can also be used directly through Hunter in the latest Android Studio, which introduces support for standard CMake based on this sample: https://github.com/forexample/android-studio-with-hunter. Once that is out of beta I will try to add an example.

but i test GLDetector's performance on lena512color.png , it's 1750.45 ms,

Wow. That is slow. The GLDetector class is a simple utility to help support end-to-end testing, it will probably be slower than the CPU path if used as is. I'm not sure what Android device you are using, but there are a couple things you can do to improve the throughput in your application.

The main bottleneck with the GPU based ACF pyramid use is the GPU-to-CPU transfer -- in particular the glFlush() required to synchronize the shader output for retrieval. If you introduce some latency, you can create a pipeline so that one thread is always queueing the current frame for pyramid computation on the GPU and retrieving the pyramid for the last frame to run the multi-scale object detection. This way you can bury the transfer overhead in that frame time.

This requires a little bit of setup. It is probably worth trying to add some high level constructs that help enable this at the application level in this repository. There is one example of this in the drishti repository here:

https://github.com/elucideye/drishti/blob/f5742c5cc737f550e57f5cb832c2ffccfe08123d/src/lib/drishti/hci/FaceFinder.cpp#L441-L523

The other bottleneck on Android is the GPU-to-CPU transfer itself. On iOS there is an efficient iOS Texture Cache mechanism to support this (that also supports parallel transfers). For Android there isn't anything so nice, but there are a few options. On older devices you can use dlopen/dlsym mappings to the GraphicBuffer. This is all implemented in ogles_gpgpu, but unfortunately recent Android releases have blocked dlopen calls entirely. If you can use OpenGL ES 3.0 or greater, then you can use glMapBuffer for this (possibly asynchronously), which is at least faster than glReadPixels. The later is also supported as an option in the ogles_gpgpu lib via build time CMake arguments. If you want some pointers for that let me know.

The other thing you can/should do, for "selfie" type applications at least, is to avoid running the detection at scales that aren't needed. If the object detector uses a windows of size 64x64, and your input video frames are VGA or higher, then you can significantly reduce the resolution of the lowest pyramid level used in the search -- you most likely aren't interested in finding 64x64 faces in a 640x480 (or higher resolution) frame. This will have a significant impact on the object detection step, and it will also reduced the ACF pyramid computation and transfer overhead.

I'm traveling now and don't have an Android device with me, but I can try to post some benchmarks for a Samsung Galaxy S7 in a few days.

headupinclouds commented 6 years ago

It is probably worth trying to add some high level constructs that help enable this at the application level in this repository

I think this can be supported in the library fairly generically with an acf::AsyncDetector class. I've added #54 to track this and will try to add something soon based on the code I mentioned above.

ygyin-ivy commented 6 years ago

thx, i will try those.

headupinclouds commented 6 years ago

Note: sample application added in #55, this will likely be merged to the installed library in a follow up after further tuning (see GPUDetectionPipeline)

headupinclouds commented 6 years ago

how can i improve performance to reach 30 fps (about 30ms per frame)?

55 is merged and includes a sample `pipeline` sample that illustrates more efficient CPU + GPU scheduling that should help achieve higher frame rates.

See: https://github.com/elucideye/acf/blob/3862fe398c8acb567d0f0b55bf170c1f17bf65d3/src/app/pipeline/pipeline.cpp#L301

I've added a test using gauze that helps automate the installation and execution of cross platform tests, including iOS and Android devices. If you plug in an Android or iOS device and run a command like this:

polly.py --toolchain android-ndk-r10e-api-19-armeabi-v7a-neon-hid-sections --fwd ACF_BUILD_TESTS=ON HUNTER_CONFIGURATION_TYPES=Release --config Release --test --verbose --config Release --install

It should finish with a log reporting speed in FPS for your device with the the 512x512 lena image w/ a 64x64 classifier and a minimum face of 128 pixels.

2: [17:28:47.387 | thread:26420 | acf-pipeline | info]: OBJECTS[57] = 1
2: [17:28:47.432 | thread:26420 | acf-pipeline | info]: OBJECTS[58] = 1
2: [17:28:47.461 | thread:26420 | acf-pipeline | info]: OBJECTS[59] = 1
2: [17:28:47.495 | thread:26420 | acf-pipeline | info]: OBJECTS[60] = 1
2: [17:28:47.527 | thread:26420 | acf-pipeline | info]: OBJECTS[61] = 1
2: [17:28:47.555 | thread:26420 | acf-pipeline | info]: OBJECTS[62] = 1
2: [17:28:47.586 | thread:26420 | acf-pipeline | info]: OBJECTS[63] = 1
2: [17:28:47.621 | thread:26420 | acf-pipeline | info]: OBJECTS[64] = 1
2: [17:28:47.621 | thread:26420 | acf-pipeline | info]: ACF FULL: FPS=30.2556
2: [17:28:47.622 | thread:26420 | acf-pipeline | info]:   ACF STAGE complete = 0.0311455
2: [17:28:47.622 | thread:26420 | acf-pipeline | info]:   ACF STAGE detect = 0.0223435
2: [17:28:47.622 | thread:26420 | acf-pipeline | info]:   ACF STAGE read = 0.0198677
2: 0
2: *** END ***
2: Done
2/2 Test #2: AcfPipelineTest ..................   Passed    3.82 sec

That simply runs the drishti-pipeline console app on your device in /data/local/tmp/acf/android-ndk-r10e-api-19-armeabi-v7a-neon-hid-sections/bin/acf-pipeline using the parameters shown here (you can change them to something suitable for your application pretty easily):

https://github.com/elucideye/acf/blob/3862fe398c8acb567d0f0b55bf170c1f17bf65d3/src/app/pipeline/CMakeLists.txt#L57-L66

In this test the performance is about 30 fps on an Android Galaxy Samsung S7. This simply repeats the same input frame 64 times, but the benchmark is probably fairly reasonable for the GPU pipeline since the CPU can't really benefit from cacheing the way a CPU test would. At 30 fps with selfie video you probably don't need to run on every frame either. In the Xcode profiler the CPU usage with this pipeline is pretty low. You can also try running with the OpenGL ES 3.0 PBO based transfers by adding ACF_OPENGL_ES3=ON, which should be more efficient then the glReadPixels() calls in the default OpenGL ES 2.0 builds.

polly.py --toolchain android-ndk-r10e-api-19-armeabi-v7a-neon-hid-sections --fwd ACF_BUILD_TESTS=ON HUNTER_CONFIGURATION_TYPES=Release ACF_OPENGL_ES3=ON --config Release --test --verbose --config Release --install --reconfig

ygyin-ivy commented 6 years ago

thx, its a excellent update, star this.

ygyin-ivy commented 6 years ago

i test GPUDetectionPipeline on mobile (Samsung S5) , bitmap is lena512color-nodpi.

arg: acf::GPUDetectionPipeline(detector, cv::Size(512,512), 3, 0, 128);

average cost is about 22 ms, it's quite fast. pipeline is a excellent idea.

headupinclouds commented 6 years ago

Great, I'm glad it is working. I'll aim to get a first version of GPUDetectionPipeline finalized in the public SDK for general use this weekend. I've parallelized the detection stage which helps quite a bit on the Samsung S7 (at least) and added a benchmark mode to pipeline.cpp so that the cpu->gpu input frame overhead is avoided, since that should be handled more efficiently in an Android application with a SurfaceTexture. With those changes the test achieves 43 fps and the detection step itself reduces to about 13 milliseconds.

2: [11:08:44.436 | thread:29445 | acf-pipeline | info]: OBJECTS[254] = 1
2: [11:08:44.457 | thread:29445 | acf-pipeline | info]: OBJECTS[255] = 1
2: [11:08:44.490 | thread:29445 | acf-pipeline | info]: OBJECTS[256] = 1
2: [11:08:44.490 | thread:29445 | acf-pipeline | info]: ACF FULL: FPS=43.1771
2: [11:08:44.490 | thread:29445 | acf-pipeline | info]:   ACF STAGE complete = 0.0226745
2: [11:08:44.490 | thread:29445 | acf-pipeline | info]:   ACF STAGE detect = 0.01344
2: [11:08:44.490 | thread:29445 | acf-pipeline | info]:   ACF STAGE read = 0.0174438

With that configuration an iPhone7 runs the test at 61 fps with a detection cost of about 4.6 milliseconds:

2: (lldb) [11:43:54.437 | thread:15933466093621603135 | acf-pipeline | info]: OBJECTS[255] = 1
2: (lldb) [11:43:54.454 | thread:15933466093621603135 | acf-pipeline | info]: OBJECTS[257] = 1
2: (lldb) [11:43:54.454 | thread:15933466093621603135 | acf-pipeline | info]: ACF FULL: FPS=61.1927
2: (lldb) [11:43:54.460 | thread:15933466093621603135 | acf-pipeline | info]:   ACF STAGE complete = 0.0154037
2: [11:43:54.460 | thread:15933466093621603135 | acf-pipeline | info]:   ACF STAGE detect = 0.00455919
2: (lldb) [11:43:54.461 | thread:15933466093621603135 | acf-pipeline | info]:   ACF STAGE read = 0.00842818

elucideye / acf

acf performance on android #53

55 is merged and includes a sample `pipeline` sample that illustrates more efficient CPU + GPU scheduling that should help achieve higher frame rates.

elucideye / acf

acf performance on android #53

55 is merged and includes a sample pipeline sample that illustrates more efficient CPU + GPU scheduling that should help achieve higher frame rates.

55 is merged and includes a sample `pipeline` sample that illustrates more efficient CPU + GPU scheduling that should help achieve higher frame rates.