Execution of Inference Workloads on Hikey970 with layer splitting

Shraddhaa1 commented 3 years ago

Hello,

I am currently working on executing inference workloads on Hikey970. I am trying to split the layers of a network amongst CPU and GPU, and run the workloads to reduce inference latency. I am following the repo attached below to run the models with CPU and GPU utilization.

https://github.com/adityagupta1089/ComputeLibrary.git

Could you guys help me understand how I can split the layers of the network and assign them to CPU and GPU?

Is there any API specific for CPU and GPU in ARM-CL?

Thanks.

morgolock commented 3 years ago

Hi @Shraddhaa1

The graph api in ACL is experimental and does not support that level of granularity to specify the backend for each individual layer.

You could experiment with the functions interface which lets you mix GPU and CPU kernels, please see the example: https://github.com/ARM-software/ComputeLibrary/blob/master/examples/neoncl_scale_median_gaussian.cpp

Hope this helps.

Shraddhaa1 commented 3 years ago

Hello Sir, Thank you for the reply. That really helped.

Have a good day.

Sincerely, Shraddha Dahal

Shraddhaa1 commented 3 years ago

Hello Sir,

I had contacted you a few weeks ago regarding the Execution of Inference Workloads on Hikey970 with layer splitting. Could you please help me know if there is any function specific in ARM-CL which would let me calculate inference time taken by each layer of a neural network?

Thank you.

Sincerely, Shraddha Dahal

developer-compute commented 3 years ago

Hi Shraddha,

If you build ACL with the option benchmark_examples=1 you can then run the network and use the instruments to see how much time each kernel is taking:

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./acl_neon+cl_release/ ./benchmark_graph_mobilenet --instruments=OPENCL_TIMESTAMPS_MS --example_args='--target=CL'

This will produce output like below:

OpenCLTimestamps/Now OpenCL: AVG=1121307039 ms OpenCLTimestamps/Now Wall clock: AVG=1623931494169279 us OpenCLTimestamps/[end]Conv2d_0+Conv2d_0/BatchNorm/gemm_mm_floating_point_f32_bifrost GWS[8,3136,1] LWS[4,1,1] #2: AVG=1121306920 ms OpenCLTimestamps/[end]Conv2d_0+Conv2d_0/BatchNorm/im2col3x3_nhwc GWS[2,12544,1] #1: AVG=1121306919 ms OpenCLTimestamps/[end]Conv2d_0+Conv2d_0/BatchNorm/reshape_to_columns GWS[3,3,3] #0: AVG=1121306918 ms OpenCLTimestamps/[end]Conv2d_10_depthwise/depthwise+Conv2d_10_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[512,7,7] #36: AVG=1121306986 ms OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[128,64,1] #40: AVG=1121306995 ms OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[128,40,1] #39: AVG=1121306989 ms OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[128,128,1] #38: AVG=1121306988 ms OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/reshape_to_columns GWS[512,1,1] #37: AVG=1121306988 ms OpenCLTimestamps/[end]Conv2d_11_depthwise/depthwise+Conv2d_11_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[512,7,7] #41: AVG=1121306995 ms OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[128,64,1] #45: AVG=1121307003 ms OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[128,40,1] #44: AVG=1121306998 ms OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[128,128,1] #43: AVG=1121306997 ms OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/reshape_to_columns GWS[512,1,1] #42: AVG=1121306997 ms OpenCLTimestamps/[end]Conv2d_12_depthwise/depthwise+Conv2d_12_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc GWS[512,7,7] #46: AVG=1121307004 ms OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[256,16,1] #50: AVG=1121307012 ms OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[128,10,1] #49: AVG=1121307008 ms OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[256,128,1] #48: AVG=1121307008 ms OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/reshape_to_columns GWS[512,1,1] #47: AVG=1121307007 ms OpenCLTimestamps/[end]Conv2d_13_depthwise/depthwise+Conv2d_13_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[1024,4,4] #51: AVG=1121307012 ms OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[256,16,1] #55: AVG=1121307025 ms OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[256,10,1] #54: AVG=1121307018 ms OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[256,256,1] #53: AVG=1121307018 ms OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/reshape_to_columns GWS[1024,1,1] #52: AVG=1121307016 ms OpenCLTimestamps/[end]Conv2d_1_depthwise/depthwise+Conv2d_1_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[32,56,56] #3: AVG=1121306921 ms OpenCLTimestamps/[end]Conv2d_1_pointwise/Conv2D+Conv2d_1_pointwise/BatchNorm/gemm_mm_floating_point_f32_bifrost GWS[16,28,112] LWS[4,1,1] #5: AVG=1121306924 ms OpenCLTimestamps/[end]Conv2d_1_pointwise/Conv2D+Conv2d_1_pointwise/BatchNorm/reshape_to_columns GWS[32,1,1] #4: AVG=1121306922 ms OpenCLTimestamps/[end]Conv2d_2_depthwise/depthwise+Conv2d_2_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc GWS[64,56,56] #6: AVG=1121306926 ms OpenCLTimestamps/[end]Conv2d_2_pointwise/Conv2D+Conv2d_2_pointwise/BatchNorm/gemm_mm_floating_point_f32_bifrost GWS[32,14,56] LWS[4,1,1] #8: AVG=1121306929 ms

Another alternative is to use Armnn's ExecuteNetwork to run a tfllite model and use the -e option which will make the tool output the time consumed by each kernel For more information about ExecuteNetwork see: https://github.com/ARM-software/armnn/tree/branches/armnn_21_02/tests/ExecuteNetwork

Hope this helps.

Shraddhaa1 commented 3 years ago

Hello Sir,

Thank you for the response. I tried the first method you mentioned in the previous email. In the makefile.arm file, I added benchmark_examples:=1 such as: BUILD:=native NEON:=1 OPENCL:=1 ARCH:=arm64-v8a all: release CFLAGS:=-std=c++14 benchmark_examples:=1 debug: scons -j8 -Q arch=$(ARCH) build=$(BUILD) neon=$(NEON) opencl=$(OPENCL) build_dir=debug debug=1 extra_cxx_flags=$(CFLAGS) release: scons -j8 -Q arch=$(ARCH) build=$(BUILD) neon=$(NEON) opencl=$(OPENCL) build_dir=release debug=0 extra_cxx_flags=$(CFLAGS) sched: g++ -o build/release/examples/graph_temp_scheduler2.o -c -Wno-deprecated-declarations -Wall -DARCH_ARM -Wextra -Wno-unused-parameter -pedantic -Wdisabled-optimization -Wformat=2 -Winit-self -Wstrict-overflow=2 -Wswitch-default -fpermissive -std=gnu++11 -Wno-vla -Woverloaded-virtual -Wctor-dtor-privacy -Wsign-promo -Weffc++ -Wno-format-nonliteral -Wno-overlength-strings -Wno-strict-overflow -Wlogical-op -Wnoexcept -Wstrict-null-sentinel -Wno-implicit-fallthrough -march=armv8-a -Wno-ignored-attributes -Werror -O3 -ftree-vectorize -std=c++14 -D_GLIBCXX_USE_NANOSLEEP -DARM_COMPUTE_CPP_SCHEDULER=1 -DARM_COMPUTE_AARCH64_V8A -DNO_DOT_IN_TOOLCHAIN -DEMBEDDED_KERNELS -Iinclude -I. -I. examples/graph_temp_scheduler2.cpp g++ -o build/release/examples/graph_temp_scheduler2 -Wl,--allow-shlib-undefined build/release/examples/graph_temp_scheduler2.o build/release/utils/Utils.o build/release/utils/GraphUtils.o build/release/utils/CommonGraphOptions.o -Lbuild/release -L. -lpthread -larm_compute_graph -larm_compute -larm_compute_core

After that, I gave commands: $make all $sudo LD_LIBRARY_PATH=/home/shunya/ComputeLibrary1/build/release ./build/release/examples/graph_mobilenet --instruments=OPENCL_TIMESTAMPS_MS --example_args='--target=CL'

but, I could not see the outputs with time taken by each kernel. The path to library in the above command has libarm_compute.so, libarm_compute_core.so, libarm_compute_graph.so, libarm_compute_core-static.a, libarm_compute_graph-static.a and libarm_compute-static.a. Could you please help me know where I am going wrong?

Have a good day.

Sincerely, Shraddha Dahal

developer-compute commented 3 years ago

Hi,

Please try running benchmark_graph_mobilenet instead of graph_mobilenet.

Hope this helps.

Shraddhaa1 commented 3 years ago

Hello Sir,

Thank you for the reply. The file benchmark_graph_mobilenet, is not generated anywhere inside the Compute Library. Could you please help me know if the way I used to build ARM-CL with benchmarks_examples=1 correct?

Have a good day.

Sincerely, Shraddha Dahal

Shraddhaa1 commented 3 years ago

Hello Sir, Thank you for the reply. I can now observe time taken by each kernel of the neural networks. Could you please help me know if there has been any update on how specific layers of a neural network can be specified to either CPU or GPU? With the repo: https://github.com/adityagupta1089/ComputeLibrary.git I could mix the use of CPU and GPU, but I am trying to specify layers to CPU or GPU. Could you please help me understand how it can be obtained?

Have a good evening.

Sincerely, Shraddha Dahal

Shraddhaa1 commented 3 years ago

Hello Sir, Thank you for the response. It really helped.

Have a good day.

Sincerely, Shraddha Dahal

Shraddhaa1 commented 3 years ago

Hello Sir,

I was working on obtaining time taken by each kernel with target CL for the available networks under development repo, as you suggested in previous emails. Could you please help me know if there are any such instruments available, which would help me know the time taken by each kernel when assigning benchmark networks under CPU? I am currently using the command mentioned below for target CL:

sudo LD_LIBRARY_PATH=/home/shunya/ComputeLibrary/build ./build/tests/benchmark_graph_mobilenet_v2 --instruments=OPENCL_TIMESTAMPS_MS --example_args='--target=CL'

Could you please help me know how I can obtain similar information with target NEON?

Have a good day.

Sincerely, Shraddha Dahal

Shraddhaa1 commented 3 years ago

Hello Sir, Thank you for the response. It worked. I am also currently working on accessing performance monitoring counters such as cache misses, IPC and memory bandwidth. I was trying to check how memory intensive each neural network is. I was using perf tool for CPU profiling but the above mentioned parameters are not supported in HiKey970 board. This can be observed from the image attached below:

[image: perf_result.png]

The parameters were not on the perf list:

[image: perf_list.png]

Could you please help me know if there are tools which would help me obtain these parameters when running a neural network?

Thanks again.

Have a good day.

Sincerely, Shraddha Dahal

Shraddhaa1 commented 3 years ago

Hello Sir,

Thank you for the reply. I will open an issue on github soon to discuss it. However, I have a question regarding the time taken by each kernel when assigning benchmark networks under CPU and GPU. By running the benchmarks on CPU and GPU, I observed that the time taken by each kernel is much more for target CL than that with target NEON. I was not expecting such a big time difference. Could you please help me understand why kernels are taking less time with target NEON?

Thank you.

Sincerely, Shraddha Dahal

HungYangChang commented 3 years ago

@Shraddhaa1 I am also working on using ARM CL with HiKey 970. Would you like to discuss this?

Shraddhaa1 commented 3 years ago

Hello Chang,

Thank you for the reply, and I would like to discuss this more. Have you gone through the CPU + GPU utilization from ComputeLibrary? I have attached link to the github below:

https://github.com/adityagupta1089/ComputeLibrary.git

Could you please help me understand how the number of images to be processed is decided by CPU and GPU? Is it possible to modify the code so that I could assign a specific number of images to CPU and GPU?

Have a good weekend.

Sincerely, Shraddha Dahal

HungYangChang commented 3 years ago

Hello @Shraddhaa1

I indeed went through the GitHub you share above, but for my work I will only focus on using CPU. Here is the reference github: https://github.com/Ehsan-aghapour/ARMCL-pipe-all.

Btw now I have moved to ARM NN, because ARM NN is built on top of ARM CL. you can check ARM NN for more info.

Shraddhaa1 commented 3 years ago

Hello Chang,

Thank you for the response. I will go through the GitHub link that you have mentioned, and email you again with some queries.

Have a good day.

Sincerely, Shraddha Dahal

Shraddhaa1 commented 2 years ago

Hello Chang,

I am currently working on the repo that you mentioned in the previous email: https://github.com/Ehsan-aghapour/ARMCL-pipe-all.

I ran the networks separately on GPU, CPU Big and CPU Little. However, I observed that CPU Big performed better than the GPU in all of the networks. Also, when I split layers of ResNet50 amongst GPU, CPU Big and CPU Little, I could see that the inference time of GPU is lesser than that of CPU Big and CPU Little. The command I used for it was:

sudo LD_LIBRARY_PATH=/home/shunya/ARMCL-pipe-all-pipe-all/build ./graph_resnet50_all_pipe_sync --threads=4 --threads2=2 --total_cores=6 --partition_point=8 --partition_point2=12 --order=G-L-B --n=50

Could you please help me know why CPU Big is better than GPU when I assign all of the layers to it?

Also, for ResNet50, I could see that the total parts are given as 18.

First partition point:8

Second partition point:12

Total parts:18

Should it not be 50 for ResNet50?

Have a good weekend.

Sincerely, Shraddha Dahal

ARM-software / ComputeLibrary

Execution of Inference Workloads on Hikey970 with layer splitting #882