Closed fengxueem closed 5 years ago
@fengxueem Thanks for pointing out this issue. It makes sense. To prevent affect the UI rendering, we split the opencl kernels bellow 1ms and expect the underlying driver to make the scheduling. For example, Adreno drivers support such options: https://github.com/XiaoMi/mace/blob/master/mace/core/runtime/opencl/opencl_extension.h#L20.
Unfortunately, Mali does not have such extensions to allow better preemption of UI rendering tasks. To throttling the opencl workload when there is no system level preemption, sleeping is a reasonable option. You can have a try whether this works.
We haven't test sub-devices, you can make a test whether it's valid. Contributions are welcome.
Thanks for replying. I totally agree with your comments. Did try this idea of sub-devices on Mali GPU, nothing happens so far. It seems that even if it does have createSubDevices on libGLES.so at rk3288 and is successfully called, no sub devices can be created. This might be related to implementation of OpenCL on Mali.
And about sleeping, I try it with my model, the good news is that it does work, FPS can be reserved around 25-30. However, what I have done is a case-by-case method, which means I have manually count each layer's inference time and split them into small groups inside which the total inference does not exceed some threshold. These small groups are enqueued every period of time, say 30ms. Nothing fancy here.
But UI rendering is still stalled during stage of "Map OpenCL Image". I am stuck at there. No idea how to map the image to OpenCL memory space while keeping FPS at reasonable rate on rk3288.
BTW, from our observation, TFLITE's quantization scheme seems not to be very good. Inference time is good, model size is good. But we do see very large accuracy difference when comparing the output of the same model running by (Tensorflow and TFLITE_8BIT_POST_TRAINING_QUANTIZATION), (Tensorflow and TFLITE_8BIT_TRAINING_AWARE_QUANTIZATION). The quantization scheme used by TensorRT and NNIE shows better result on accuracy, FYI. But it's still good to see MACE with 8bit inference.
@fengxueem For the throttling, it's possible to make it automatic by leveraging the current auto kernel split code which already measured the execution time of the kernel.
For the mapping, is it caused by memory bandwidth consumption? It's indeed unexpected since there should be no memory copy in this stage.
System information
Model deploy file (*.yml)
Describe the problem
Hi, MACE contributors
I am running MACE Android Demo on rk3288 for a while, and I do think this is a very promising mobile inference framework for utilizing GPU, DSP. Thanks for open-sourcing this incredible work. There is a very natural question to ask when we are using GPU to carry out time-comsuming operations, like what you guys put on FAQ, "Why is the UI getting poor responsiveness when running model with GPU runtime?". I also inevitably encounter it. Of course, this is a very complex issue related to hardware and software. Since we are dealing with SW only, I just wanna ask one more time to see if there are some upcoming solutions to solve this poor FPS problem or not. Here are two screen shots I recorded on ARM profiling tool ( Streamline and Mali Graphics Debugger) while the Android Demo is running on rk3288:
Here are my thoughts of possible ways to tackle it.
All in all, I only have this rk3288 released 4 years ago. Hope other developers won't face this UI problem on other devices.
To Reproduce
Just install the Andoird Demo on rk3288 and you would see poor FPS when MACE is switching to GPU mode.
Error information / logs
NONE
Additional context
NONE