ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
2.87k stars 782 forks source link

How to Reduce CL target run Time at intialisation stage ( Prepare time) #977

Closed abhajaswal closed 2 years ago

abhajaswal commented 2 years ago

Hello I try to use the ACL 20.2

As you know when we run any example for ACL in iteration Example mobilenet SSD v1 -> 1st time call for graph_run(0 takes about 1 min

2nd time onwards the graph run takes about 92 ms.

As i understand 1st time ACL creates the pipeline and memory/buffers etc , so it takes time , but is there any way i can reduce the 1st time initialisation time?

abhajaswal commented 2 years ago

Dear team,

Could you let me know what could be the root cause? Does the pipline creation takes more time?

I need to review the usage of the ARMNN further , but in case prepare takes lot of time then i need an understanding about it.

Tiime taken by ARMNN CPU to prepare : 992ms ARMNN GPU : 9607ms

Tiime taken by opensource tflite CPU plugin : 44ms

morgolock commented 2 years ago

Hi @abhajaswal

Could you please try with the latest release 22.05? There have been some improvements in the startup time since 20.02.

In general, both for CPU and GPU, the first iteration is slower because during this run ACL performs various transformations on the tensors to make sure the memory is accessed in the best way possible. All this additional work is done by the operators in their corresponding ::prepare() methods. For example look at ClGemmConv2d : https://github.com/ARM-software/ComputeLibrary/blob/main/src/gpu/cl/operators/ClGemmConv2d.cpp#L617

For the OpenCL backend you also have to add the time to compile the OpenCL kernels at runtime, which occurs during configuration. To mitigate this problem you can save the compiled kernels to disk and restore them at runtime. For more information please see the example: https://github.com/ARM-software/ComputeLibrary/blob/main/examples/cl_cache.cpp

Please also be aware that the use of the opencl tuner in acl can affect startup time too, for more information please see: https://arm-software.github.io/ComputeLibrary/latest/architecture.xhtml#architecture_opencl_tuner

It would be helpful if you could share the complete command you used to run the example.

abhajaswal commented 2 years ago

Thanks ! Using cl_cache.bin i am able to reduce the time to load model from 20612 ms to Init : 1379 ms

After cl_cache.bin restore Image read time (From file or camera) Min: 11 ms Max: 11 ms Avg: 11 ms Image pre-process time Min: 1 ms Max: 1 ms Avg: 1 ms Model inference time Min: 70 ms Max: 70 ms Avg: 70 ms Model init/deinit time Init : 1379 ms Info: Shutdown time: 61.85 ms

initial was at time of 1st time save cl_cache.bin :

------------ PERFORMANCE ------------------ Image read time (From file or camera) Min: 12 ms Max: 12 ms Avg: 12 ms Image pre-process time Min: 1 ms Max: 1 ms Avg: 1 ms Model inference time Min: 69 ms Max: 69 ms Avg: 69 ms Model init/deinit time Init : 20612 ms Info: Shutdown time: 120.91 ms

-rwxr-xr-x 1 root root 2419612 Jan 2 18:24 armnn_clcahae.bin -rw-r--r-- 1 root root 23018392 Jul 8 2022 od_tflite_model.tflite

This .bin file i will have to generate for N number of models , so wont it take up more memory . Could we not reduce load time without cl_cache ?

Actually i tried TFlite GPU delegate , the load time for it is also low, i dint had to generate cl_cache.bin step for it. So i wonder how tflite team optimized the load time and why using ARMNN/ACL i had to do this

morgolock commented 2 years ago

Hi @abhajaswal

Glad to hear you improved the load time using prebuilt opencl kernels.

This .bin file i will have to generate for N number of models , so wont it take up more memory .

Yes, you could easily implement deflating/inflating with something like zlib at runtime to reduce the size on disk if that's a concern.

Could we not reduce load time without cl_cache ?

Unfortunately not without a major rework of the library. At runtime the OpenCL kernels need to be compiled and that is what requires the additional time.

Hope this helps.