Thoughts on UI rendering problem when running MACE on Mali GPU

fengxueem commented 5 years ago

System information

OS Platform and Distribution Linux Ubuntu 16.04:
NDK version :16b
GCC version :5.4.0
MACE version :v0.9.0-246-g1d40ca5-20181015
Python version: 2.7
Bazel version: 0.13.1

Model deploy file (*.yml)

......

Describe the problem

Hi, MACE contributors

I am running MACE Android Demo on rk3288 for a while, and I do think this is a very promising mobile inference framework for utilizing GPU, DSP. Thanks for open-sourcing this incredible work. There is a very natural question to ask when we are using GPU to carry out time-comsuming operations, like what you guys put on FAQ, "Why is the UI getting poor responsiveness when running model with GPU runtime?". I also inevitably encounter it. Of course, this is a very complex issue related to hardware and software. Since we are dealing with SW only, I just wanna ask one more time to see if there are some upcoming solutions to solve this poor FPS problem or not. Here are two screen shots I recorded on ARM profiling tool ( Streamline and Mali Graphics Debugger) while the Android Demo is running on rk3288:

Here are my thoughts of possible ways to tackle it.

From my poor understanding for the source code: mace/kernels/opencl/helper.cc inside function TuningOrRun2DKernel and TuningOrRun3DKernel, when it's tunning a given model, the OpenCL kernel will be modified to avoid taking full use of GPU for too long. But as for as I know about OpenCL's design logic, it just consumes as much computing resources as possible when the OpenCL command queue is not empty. So may be we can delay the enqueuing process at Run function on mace/core/net.cc to let the GPU be idle for 10ms at every period of 30ms? Or applying the tuning technique or something alike we are using right now on TuningOrRun2DKernel to this enqueuing process?
If the OpenCL platform fully supports OpenCL 1.2, maybe we can create sub devices instead of taking up whole GPU when declaring a OpenCL Context at OpenCLRuntime constructor on mace/core/runtime/opencl/opencl_runtime.cc.
There is one more thing to take care of, Map OpenCL Image, before inferencing. Based on some observations I tried with Streamline, I do think this mapping stage also uses GPU. Are there any methods to prevent it from taking too much GPU time?

All in all, I only have this rk3288 released 4 years ago. Hope other developers won't face this UI problem on other devices.

To Reproduce

Just install the Andoird Demo on rk3288 and you would see poor FPS when MACE is switching to GPU mode.

Error information / logs

NONE

Additional context

NONE

llhe commented 5 years ago

@fengxueem Thanks for pointing out this issue. It makes sense. To prevent affect the UI rendering, we split the opencl kernels bellow 1ms and expect the underlying driver to make the scheduling. For example, Adreno drivers support such options: https://github.com/XiaoMi/mace/blob/master/mace/core/runtime/opencl/opencl_extension.h#L20.

Unfortunately, Mali does not have such extensions to allow better preemption of UI rendering tasks. To throttling the opencl workload when there is no system level preemption, sleeping is a reasonable option. You can have a try whether this works.

llhe commented 5 years ago

We haven't test sub-devices, you can make a test whether it's valid. Contributions are welcome.

fengxueem commented 5 years ago

Thanks for replying. I totally agree with your comments. Did try this idea of sub-devices on Mali GPU, nothing happens so far. It seems that even if it does have createSubDevices on libGLES.so at rk3288 and is successfully called, no sub devices can be created. This might be related to implementation of OpenCL on Mali.

fengxueem commented 5 years ago

And about sleeping, I try it with my model, the good news is that it does work, FPS can be reserved around 25-30. However, what I have done is a case-by-case method, which means I have manually count each layer's inference time and split them into small groups inside which the total inference does not exceed some threshold. These small groups are enqueued every period of time, say 30ms. Nothing fancy here.

But UI rendering is still stalled during stage of "Map OpenCL Image". I am stuck at there. No idea how to map the image to OpenCL memory space while keeping FPS at reasonable rate on rk3288.

BTW, from our observation, TFLITE's quantization scheme seems not to be very good. Inference time is good, model size is good. But we do see very large accuracy difference when comparing the output of the same model running by (Tensorflow and TFLITE_8BIT_POST_TRAINING_QUANTIZATION), (Tensorflow and TFLITE_8BIT_TRAINING_AWARE_QUANTIZATION). The quantization scheme used by TensorRT and NNIE shows better result on accuracy, FYI. But it's still good to see MACE with 8bit inference.

llhe commented 5 years ago

@fengxueem For the throttling, it's possible to make it automatic by leveraging the current auto kernel split code which already measured the execution time of the kernel.

For the mapping, is it caused by memory bandwidth consumption? It's indeed unexpected since there should be no memory copy in this stage.

XiaoMi / mace