How to generate the graph that ARM-CL can benchmark?

HungYangChang commented 3 years ago

arm_compute_version=v21.05 Build options: {'arch': 'arm64-v8a', 'build': 'native', 'neon': '1', 'opencl': '1', 'build_dir': 'release', 'debug': '0', 'extra_cxx_flags': '-std=c++14'} Git hash=b'55b5b4b2079e14d190d411c41a8549aa36ee0b77' **Output of 'strings libarm_compute.so | grep arm_compute_version':** **Platform:** HiKey970 **Operating System:** Linux

Problem description:

Hello,

I am currently working on benchmarking BERT model. I have used Tensorflow lite before but it seems they are unable to support layer-level splitting of task. Therefore, I switch to use Arm-CL right now. I am wondering how to generate the graph that ARM-CL can benchmark. Is there any way to convert TensorFlow files into such graphs? Or is it possible to use TensorFlow lite file (.tflite)? I went over issue 863, but I still have no clue about how to do so. Could you please give me more explanation?

Thanks in advance :)

morgolock commented 3 years ago

Hi @HungYangChang

I am wondering how to generate the graph that ARM-CL can benchmark. Is there any way to convert TensorFlow files into such graphs? Or is it possible to use TensorFlow lite file (.tflite)?

ACL does not provide any tools to generate a graph example from a tflite file and there is no way to use a tflite model directly in ACL.

The best approach would be to use armnn's ExecuteNetwork: https://github.com/ARM-software/armnn/tree/branches/armnn_21_05/tests/ExecuteNetwork

Using ExecuteNetwork you can run tflite models as shown below:

root@odroid:~/pablo# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./ExecuteNetwork  -m mobilenet_v2_1.0_224_quant.tflite  -c GpuAcc --iterations=10 -f tflite-binary  -i input -o output | grep Inference
Inference time: 132.36 ms
Inference time: 131.83 ms
Inference time: 131.63 ms
Inference time: 131.93 ms
Inference time: 131.73 ms
Inference time: 131.86 ms
Inference time: 131.77 ms
Inference time: 131.76 ms
Inference time: 131.81 ms
Inference time: 132.32 ms

HungYangChang commented 3 years ago

Thanks for the instant reply. I will try to use ARM NN to run the simulation with tflite file. I still have a few questions:

Does ARM NN provide layer-wise granularity inference information?
Can ARM NN split work into different layers, and assign layers into different cores. For instance, for 4 layers neural network, assigning the first 2 layers into 3 Big cores, the third layer into another big core, and the last layer into 4 small). I am wondering if ARM NN supports such layer-level splitting settings as shown below?
I am still wondering is it possible to convert "Tensorflow.py" file into such graph example?

Thanks for your help in advance :)

HungYangChang commented 3 years ago

@morgolock Kindly ask if there is any update on my following question?

morgolock commented 3 years ago

Hi @HungYangChang

I am still wondering is it possible to convert "Tensorflow.py" file into such graph example?

ACL does not provide any tools to automatically convert this to a graph example. If you have a tflite model I would suggest you use armnn's ExecuteNetwork to run it on CpuAcc and get profiling output

Can ARM NN split work into different layers, and assign layers into different cores.

Armnn implements the workloads on the top of ACL, the workload is divided in multiple smaller workloads and executed on different cores. This is controller by ACL by https://github.com/ARM-software/ComputeLibrary/blob/master/arm_compute/runtime/CPP/CPPScheduler.h#L59

Does ARM NN provide layer-wise granularity inference information?

It's possible to do, if you are using the tflite parser you use:

    virtual void BackendSelectionHint(Optional<BackendId> backend) = 0;

https://github.com/ARM-software/armnn/blob/8a4bd6671d0106dfb788b8c9019f2f9646770f8d/include/armnn/INetwork.hpp#L99

For more information on this please see: https://github.com/ARM-software/armnn/issues/536

Hope this helps.

HungYangChang commented 3 years ago

@morgolock Thanks for explaining. It really helps :)

HungYangChang commented 3 years ago

Hello @morgolock

Now I was able to use ARM NN + TFlite to run the BERT model. The command I am using is: ./build/tests/ExecuteNetwork -m ./tflite/mobilebert_1_default_1.tflite -f tflite-binary -c CpuRef -i input_ids,input_mask,segment_ids -o end_logits,start_logits --input-tensor-shape=1,384:1,384:1,384

However, I am still wondering how can I split such BERT.tflite model into different layers, and assign layers into different cores.

Armnn implements the workloads on the top of ACL, the workload is divided into multiple smaller workloads and executed on different cores. This is controller by ACL by https://github.com/ARM-software/ComputeLibrary/blob/master/arm_compute/runtime/CPP/CPPScheduler.h#L59

I have referred to the file you provided, but I still don't know how exactly can we divide layers into the different cores. Could you elaborate more?

I have asked in ARM NN community but they suggest me ask in here.

Q3: Does Arm NN + ACL have a mechanism to tie layer execution to a NEON core? Short answer is no. You can play around with the graph and scheduler as they did here but this is most definitely not trivial.

I'm sorry I don't have a more detailed answer for you - this stuff is way past my knowledge. You might be better off asking the question in ComputeLibrary

Thanks for your help in advance :)

morgolock commented 3 years ago

Hi @HungYangChang

Now I was able to use ARM NN + TFlite to run the BERT model. The command I am using is: ./build/tests/ExecuteNetwork -m ./tflite/mobilebert_1_default_1.tflite -f tflite-binary -c CpuRef -i input_ids,input_mask,segment_ids -o end_logits,start_logits --input-tensor-shape=1,384:1,384:1,384 However, I am still wondering how can I split such BERT.tflite model into different layers, and assign layers into different cores.

This is automatically done by Armnn/ACL if you run with the option -c CpuAcc. A workload is computed concurrently on the number of cpu cores available on the system.

At runtime the scheduler queries the number of cores available on the system in: https://github.com/ARM-software/ComputeLibrary/blob/master/src/runtime/IScheduler.cpp#L37

This is done in the class cpuinfo in https://github.com/ARM-software/ComputeLibrary/blob/8b3fc248b2f0f1873852c97b69f669b5d77cf55e/src/common/cpuinfo/CpuInfo.cpp#L362

Then the scheduler will split the workload in multiple threads as shown in https://github.com/ARM-software/ComputeLibrary/blob/master/src/runtime/CPP/CPPScheduler.cpp#L501

The ExecuteNetwork program (out of the box) does not provide a way to specify which workloads run on which backends. You have to run the complete workload on the same backend. If you choose the CpuAcceleration backend, then ACL will compute single workloads concurrently on multiple cores.

Hope this helps.

HungYangChang commented 3 years ago

Hello @morgolock

Thanks for your explanation. Given that "ExecuteNetwork program (the leftmost one)" does not provide a way to specify which workloads run on which backends, I have tried to use TFlite Delegate + benchmark (the rightmost one) right now. (My reference: ARM youtube: ARM NN Tflite delegate tutorial

For TFlite Delegate + benchmark, the command I use :

taskset -c 0-3 ./benchmark_model 
--graph=/home/shunya/Micro_SD_shunya/hungyang/tflite_design_space/Export/H-768_S-128_L-4_A-8_I-3072.tflite  
--warmup_runs=2 --num_threads=4 --num_runs=3 
--external_delegate_path="../armnn/build/delegate/libarmnnDelegate.so" 
--external_delegate_options="backends:CpuAcc"

Say I have 4 CPU cores. The model is BERT with 4 layers (L=4 in H-768_S-128_L-4_A-8_I-3072.tflite ) I wanna split each 4 layers BERT model into different cores. (i.e. layer1 -> Core1, layer2 -> Core2, and so far... ) Now my question is :

How can I assign each layer of my work as one single "workload/node"?
How to assign each workload/node to its corresponding core?

====== Update @ 9/20 24:00 More info: my reference paper: As paper High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors state: Graph API accompanying ARM-CL facilitates the creation of complex networks. The network is written with a dedicated API as a graph by the user at the frontend. The execution is automatically handled at the backend. Graph implements the layers as nodes that are connected to other nodes in the CNN sequence as defined by the user. Inside each node, the workload is represented as a series of compute kernels. Runtime scheduler sequentially dispatches the kernels in q node and engages respective processing unit during execution. ARM-CL implements a convolution node with NEON acceleration using im2col (Image to Column) and GEMM (GEneral Matrix Multiplication) kernels. In addition, the parallel nature of the kernels allows their computations to be distributed across multiple cores. This node-level parallelization is implemented in the form of a thread pool that spawns several new threads and distributes the computation of a kernel among them before the scheduler dispatches them for execution.

I will write an email to ask the author regarding how they set it as well, but is it possible to give more detail on it? Thanks for your great help in advance :)

ARM-software / ComputeLibrary

How to generate the graph that ARM-CL can benchmark? #921