Closed HungYangChang closed 2 years ago
Hi @HungYangChang
I am wondering how to generate the graph that ARM-CL can benchmark. Is there any way to convert TensorFlow files into such graphs? Or is it possible to use TensorFlow lite file (.tflite)?
ACL does not provide any tools to generate a graph example from a tflite file and there is no way to use a tflite model directly in ACL.
The best approach would be to use armnn's ExecuteNetwork: https://github.com/ARM-software/armnn/tree/branches/armnn_21_05/tests/ExecuteNetwork
Using ExecuteNetwork you can run tflite models as shown below:
root@odroid:~/pablo# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./ExecuteNetwork -m mobilenet_v2_1.0_224_quant.tflite -c GpuAcc --iterations=10 -f tflite-binary -i input -o output | grep Inference
Inference time: 132.36 ms
Inference time: 131.83 ms
Inference time: 131.63 ms
Inference time: 131.93 ms
Inference time: 131.73 ms
Inference time: 131.86 ms
Inference time: 131.77 ms
Inference time: 131.76 ms
Inference time: 131.81 ms
Inference time: 132.32 ms
Thanks for the instant reply. I will try to use ARM NN to run the simulation with tflite file. I still have a few questions:
Does ARM NN provide layer-wise granularity inference information?
Can ARM NN split work into different layers, and assign layers into different cores. For instance, for 4 layers neural network, assigning the first 2 layers into 3 Big cores, the third layer into another big core, and the last layer into 4 small). I am wondering if ARM NN supports such layer-level splitting settings as shown below?
I am still wondering is it possible to convert "Tensorflow.py" file into such graph example?
Thanks for your help in advance :)
@morgolock Kindly ask if there is any update on my following question?
Hi @HungYangChang
I am still wondering is it possible to convert "Tensorflow.py" file into such graph example?
ACL does not provide any tools to automatically convert this to a graph example. If you have a tflite model I would suggest you use armnn's ExecuteNetwork to run it on CpuAcc and get profiling output
Can ARM NN split work into different layers, and assign layers into different cores.
Armnn implements the workloads on the top of ACL, the workload is divided in multiple smaller workloads and executed on different cores. This is controller by ACL by https://github.com/ARM-software/ComputeLibrary/blob/master/arm_compute/runtime/CPP/CPPScheduler.h#L59
Does ARM NN provide layer-wise granularity inference information?
It's possible to do, if you are using the tflite parser you use:
virtual void BackendSelectionHint(Optional<BackendId> backend) = 0;
For more information on this please see: https://github.com/ARM-software/armnn/issues/536
Hope this helps.
@morgolock Thanks for explaining. It really helps :)
Hello @morgolock
Now I was able to use ARM NN + TFlite to run the BERT model.
The command I am using is:
./build/tests/ExecuteNetwork -m ./tflite/mobilebert_1_default_1.tflite -f tflite-binary -c CpuRef -i input_ids,input_mask,segment_ids -o end_logits,start_logits --input-tensor-shape=1,384:1,384:1,384
However, I am still wondering how can I split such BERT.tflite model into different layers, and assign layers into different cores.
Armnn implements the workloads on the top of ACL, the workload is divided into multiple smaller workloads and executed on different cores. This is controller by ACL by https://github.com/ARM-software/ComputeLibrary/blob/master/arm_compute/runtime/CPP/CPPScheduler.h#L59
I have referred to the file you provided, but I still don't know how exactly can we divide layers into the different cores. Could you elaborate more?
I have asked in ARM NN community but they suggest me ask in here.
Q3: Does Arm NN + ACL have a mechanism to tie layer execution to a NEON core? Short answer is no. You can play around with the graph and scheduler as they did here but this is most definitely not trivial.
I'm sorry I don't have a more detailed answer for you - this stuff is way past my knowledge. You might be better off asking the question in ComputeLibrary
Thanks for your help in advance :)
Hi @HungYangChang
Now I was able to use ARM NN + TFlite to run the BERT model. The command I am using is: ./build/tests/ExecuteNetwork -m ./tflite/mobilebert_1_default_1.tflite -f tflite-binary -c CpuRef -i input_ids,input_mask,segment_ids -o end_logits,start_logits --input-tensor-shape=1,384:1,384:1,384 However, I am still wondering how can I split such BERT.tflite model into different layers, and assign layers into different cores.
This is automatically done by Armnn/ACL if you run with the option -c CpuAcc. A workload is computed concurrently on the number of cpu cores available on the system.
At runtime the scheduler queries the number of cores available on the system in: https://github.com/ARM-software/ComputeLibrary/blob/master/src/runtime/IScheduler.cpp#L37
This is done in the class cpuinfo in https://github.com/ARM-software/ComputeLibrary/blob/8b3fc248b2f0f1873852c97b69f669b5d77cf55e/src/common/cpuinfo/CpuInfo.cpp#L362
Then the scheduler will split the workload in multiple threads as shown in https://github.com/ARM-software/ComputeLibrary/blob/master/src/runtime/CPP/CPPScheduler.cpp#L501
The ExecuteNetwork program (out of the box) does not provide a way to specify which workloads run on which backends. You have to run the complete workload on the same backend. If you choose the CpuAcceleration backend, then ACL will compute single workloads concurrently on multiple cores.
Hope this helps.
Hello @morgolock
Thanks for your explanation. Given that "ExecuteNetwork program (the leftmost one)" does not provide a way to specify which workloads run on which backends, I have tried to use TFlite Delegate + benchmark (the rightmost one) right now. (My reference: ARM youtube: ARM NN Tflite delegate tutorial
For TFlite Delegate + benchmark, the command I use :
taskset -c 0-3 ./benchmark_model
--graph=/home/shunya/Micro_SD_shunya/hungyang/tflite_design_space/Export/H-768_S-128_L-4_A-8_I-3072.tflite
--warmup_runs=2 --num_threads=4 --num_runs=3
--external_delegate_path="../armnn/build/delegate/libarmnnDelegate.so"
--external_delegate_options="backends:CpuAcc"
Say I have 4 CPU cores. The model is BERT with 4 layers (L=4 in H-768_S-128_L-4_A-8_I-3072.tflite ) I wanna split each 4 layers BERT model into different cores. (i.e. layer1 -> Core1, layer2 -> Core2, and so far... ) Now my question is :
====== Update @ 9/20 24:00 More info: my reference paper: As paper High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors state: Graph API accompanying ARM-CL facilitates the creation of complex networks. The network is written with a dedicated API as a graph by the user at the frontend. The execution is automatically handled at the backend. Graph implements the layers as nodes that are connected to other nodes in the CNN sequence as defined by the user. Inside each node, the workload is represented as a series of compute kernels. Runtime scheduler sequentially dispatches the kernels in q node and engages respective processing unit during execution. ARM-CL implements a convolution node with NEON acceleration using im2col (Image to Column) and GEMM (GEneral Matrix Multiplication) kernels. In addition, the parallel nature of the kernels allows their computations to be distributed across multiple cores. This node-level parallelization is implemented in the form of a thread pool that spawns several new threads and distributes the computation of a kernel among them before the scheduler dispatches them for execution.
I will write an email to ask the author regarding how they set it as well, but is it possible to give more detail on it? Thanks for your great help in advance :)
Problem description:
Hello,
I am currently working on benchmarking BERT model. I have used Tensorflow lite before but it seems they are unable to support layer-level splitting of task. Therefore, I switch to use Arm-CL right now. I am wondering how to generate the graph that ARM-CL can benchmark. Is there any way to convert TensorFlow files into such graphs? Or is it possible to use TensorFlow lite file (.tflite)? I went over issue 863, but I still have no clue about how to do so. Could you please give me more explanation?
Thanks in advance :)