federicoparra commented 3 months ago

any of the three versions of this very simple model (that enlightens images) works fine with the default interpreter but fail with the armnn delegate (last version) with error:

Error: An error occurred when preparing the network workloads: ComparisonQueueDescriptor: Tensors input_0 & input_1 must have the same number of dimensions in order to be broadcasted

error makes no sense as the model has only one input (an image, 400x600)

But more deeply I'm a little pissed with this armnn delegate: you advertise the library as falling back to the standard interpreter to support all ops - and yet this is the second model that I run that works fine with the standard interpreter yet fails with the ARMNN delegate.

Please! make it so that if there is an error in any part of the model running with ARMNN then that it runs that part with standard interpreter to really have full support!

I attached the models so that you can try them yourself.

federicoparra commented 3 months ago

another model with problems https://github.com/ARM-software/armnn/issues/758

Colm-in-Arm commented 3 months ago

Hello Federico,

Unfortunately I'm not permitted to test using your model without knowing the source and license under which it is shared. However, from your description of the problem it appears that whichever backend you are targeting says the layer is valid but then subsequently prevents a workload being created to execute the layer.

Can you tell me which backend you are trying to use? You could also try using the CpuRef backend. It will not be performant but if it runs the layer it will give a strong indication that it is the backend that's at fault.

Colm.

federicoparra commented 3 months ago

First of all, please be more precise about what exactly do you need in order to be able to execute the models I'm sharing here and in https://github.com/ARM-software/armnn/issues/758. In both cases these are open source models, the one about the current bug report is located here https://github.com/aselsan-research-imaging-team/flight-net and the one about https://github.com/ARM-software/armnn/issues/758 is located here https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b Now of course those links take you to the PyTorch version of the models because you know well that no one in the world almost works with TFLITE (or TensorFlow for that matter), so the models I'm sharing are my conversions. If you "are not permitted to test using" a user's conversion of a PyTorch model and you require to have officially released models in TFLITE then ARMNN is basically useless, since no one is providing those models for any AI development directly.

Regarding back end I tried both CpuAcc and GpuAcc, both separately, together (in both orders).

Both the release for ARM64 as well as my own compilation of ARMNN do not support the CpuRef (at least not when run on an Orange Pi5B), so I can't test that unless you give me a different way.

Please, check the links I just shared to confirm both models (both the one of this bug report and the one of my other bug report) are open licensed models and please try them yourselves.

ARMNN when it works is really performant but if you can't make it work with more models then it will fall onto oblivion.

Thank you, Federico

Colm-in-Arm commented 3 months ago

Hello Federico,

Thank you for pointing me to the source of the models. I can see they are creative commons which is good. However, you converting them to tflite is considered a derivative work. Two options:

You can include a copy of the CC BY-NC-SA 4.0 license in the zip file.
State in this thread the license you want this work to be associated with. (It cannot be any more restrictive than CC BY-NC-SA 4.0 license)

If you want to try CpuRef you can find a binary release of Arm NN for aarch64 platforms that includes the CpuRef here

Colm.

federicoparra commented 3 months ago

Hi, thank you for the detail.

I want to assert here that the converted tflite model I shared above, which is based on https://github.com/aselsan-research-imaging-team/flight-net, was shared here under license CC BY-NC-SA 4.0

I'll go now to the other bug report (https://github.com/ARM-software/armnn/issues/758) and do the same so that you can test that model as well.

I'll try the CpuRef ASAP and report back as well.

All my best, Federico

Colm-in-Arm commented 3 months ago

Hello Federico,

I tried feather_32bit.tflite and it shows the same ComparisonQueueDescriptor error in CpuRef as you saw with CpuAcc. I'll start investigating it now.

Colm.

Colm-in-Arm commented 3 months ago

Hello Federico,

I have a review up to resolve the first problem I encountered: https://review.mlplatform.org/c/ml/armnn/+/11379

You can cherry pick this patch on top of Arm NN main branch if you want to experiment with it. The fault came down to inconsistent handling of broadcast in the Greater_Equal layer.

I have verified byte level accuracy of results between TfLite runtime and CpuRef backends.

However, there appears to be a further problem with the CpuAcc and GpuAcc backends that I'm investigating now.

Colm.

Colm-in-Arm commented 2 months ago

Hello Federico,

I've resolved the problems in CpuAcc and GpuAcc now. This model contains kTfLiteBool which is a data type we don't often encounter.

I've updated the review, https://review.mlplatform.org/c/ml/armnn/+/11379, to include all the necessary changes.

Colm.

federicoparra commented 2 months ago

Hello Federico,

I've resolved the problems in CpuAcc and GpuAcc now. This model contains kTfLiteBool which is a data type we don't often encounter.

I've updated the review, https://review.mlplatform.org/c/ml/armnn/+/11379, to include all the necessary changes.

Colm.

Amazing! I'll try this today! Please look at the other case #758 whenever you have the time, I added the license for that model as well in the thread.

federicoparra commented 2 months ago

Hello Federico,

I've resolved the problems in CpuAcc and GpuAcc now. This model contains kTfLiteBool which is a data type we don't often encounter.

I've updated the review, https://review.mlplatform.org/c/ml/armnn/+/11379, to include all the necessary changes.

Colm.

so weird! I applied the patch on the latest pull of the main branch of armnn, using git apply 48eefee.diff (after downloading your patch), I checked the files locally to see the patch was indeed applied - it was; I then went to the build tool script and built armnn, and... I get exactly the same error somehow!

RuntimeError: TfLiteArmnnDelegate: Exception (TfLiteArmnnDelegate: Network could not be loaded: An error occurred when preparing the network workloads: ComparisonQueueDescriptor: Tensors input_0 & input_1 must have the same number of dimensions in order to be broadcasted) caught from LoadNetwork.

what could it possibly be? I'm sure the build tool is somehow not using the patched files...

Colm-in-Arm commented 2 months ago

You do need to check your build. The change to AddBroadcastReshapeLayer.hpp will resolve that specific error.

Colm.

federicoparra commented 2 months ago

You do need to check your build. The change to AddBroadcastReshapeLayer.hpp will resolve that specific error.

Colm.

Oh as I said I applied the patch and then built, so all the files including the one you just mentioned have all the changes you made.

Could it be that the build tool accept uses a specific branch or uses another source? Otherwise I'm lost :(

Colm-in-Arm commented 2 months ago

Running build-armnn.sh should reuse the version of Arm NN previously cloned by setup-armnn.sh. Any changes you've made to the cloned repository should be built.

I've no idea why you're not seeing the changes. If you deliberately break the code and rebuild does the build fail?

You could also try the --clean option to force a clean build?

Colm.

federicoparra commented 2 months ago

ok, I see the issue now - I was cloning the main branch then applying the patch then going to its build-tool/script directory and running setup-armnn and build-armnn with the understanding it would use the version of the source code that was already cloned - I haven't noticed a new source folder was created by setup-armnn and that it was THAT folder that needed patching. I applied the patch to that folder after doing a pull and then built and...I still have errors :(

Using this python code to call it, using the feather_16bit.tflite model

armnn_delegate = tflite.experimental.load_delegate( library="/home/federico/Documents/code/ARM/aarch64_build/delegate/libarmnnDelegate.so", options={"backends": "GpuAcc", "logging-severity":"trace"})

Delegates/Executes all operations supported by Arm NN to/with Arm NN

interpreter = tflite.Interpreter(model_path="../models/feather_16bit.tflite", experimental_delegates=[armnn_delegate])

This are the outputs (log level trace) with each backend used independently:

GpuAcc:

Error: ArmNN Failed to visit node with error: in data_size_from_type ./arm_compute/core/utils/DataTypeUtils.h:67: Invalid data type (this error repeats hundreds of times)

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.61 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: ConvertConstDequantisationLayersToConstLayersImpl::ReplaceConstDequantisationLayer() Info: constantInfo datatype:Float16inputDequantizeInfo datatype:Float16outputDequantizeInfo datatype:Float32 Info: ConvertConstDequantisationLayersToConstLayersImpl:: Converting FP16 -> FP32 this message happens many times, then: Info: C INFO: TfLiteArmnnDelegate: Added backend GpuAcc

and then:

RuntimeError Traceback (most recent call last) Cell In[21], line 11 7 armnn_delegate = tflite.experimental.load_delegate( library="/home/federico/Documents/code/ARM/aarch64_build/delegate/libarmnnDelegate.so", 8 options={"backends": "GpuAcc", "logging-severity":"trace"}) 10 # Delegates/Executes all operations supported by Arm NN to/with Arm NN ---> 11 interpreter = tflite.Interpreter(model_path="../models/feather_16bit.tflite", experimental_delegates=[armnn_delegate]) 13 interpreter.allocate_tensors() 15 # Get input and output tensors.

File ~/miniconda3/envs/mlc-chat-venv/lib/python3.11/site-packages/tensorflow/lite/python/interpreter.py:513, in Interpreter.init(self, model_path, model_content, experimental_delegates, num_threads, experimental_op_resolver_type, experimental_preserve_all_tensors, experimental_disable_delegate_clustering) 511 self._delegates = experimental_delegates 512 for delegate in self._delegates: --> 513 self._interpreter.ModifyGraphWithDelegate( 514 delegate._get_native_delegate_pointer()) # pylint: disable=protected-access 515 self._signature_defs = self.get_signature_list() 517 self._metrics = metrics.TFLiteMetrics()

RuntimeError: TfLiteArmnnDelegate: Exception (Failed to assign a backend to each layer) caught from optimize.

onvertConstDequantisationLayersToConstLayersImpl::ReplaceConstDequantisationLayer() Info: constantInfo datatype:Float16inputDequantizeInfo datatype:Float16outputDequantizeInfo datatype:Float32 Info: ConvertConstDequantisationLayersToConstLayersImpl:: Converting FP16 -> FP32 Info: ConvertConstDequantisationLayersToConstLayersImpl::ReplaceConstDequantisationLayer() this last error repeats a lot, and then:

Info: Optimize ArmnnSubgraph time: 3.50 ms Info: Load ArmnnSubgraph time: 348.06 ms Info: Overall ArmnnSubgraph creation time: 352.47 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.07 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Warning: WARNING: Layer of type Convolution2d is not supported on requested backend GpuAcc for input data type Float32 and output data type Float32 (reason: ArmNN ClDepthwiseConv2dWorkload does not support non constant bias.), falling back to the next backend. Warning: ERROR: Layer of type Convolution2d is not supported on any preferred backend [GpuAcc ] Warning: WARNING: Layer of type Convolution2d is not supported on requested backend GpuAcc for input data type Float32 and output data type Float32 (reason: ArmNN ClDepthwiseConv2dWorkload does not support non constant bias.), falling back to the next backend. Warning: ERROR: Layer of type Convolution2d is not supported on any preferred backend [GpuAcc ]

CpuAcc:

Error: ArmNN Failed to visit node with error: in data_size_from_type ./arm_compute/core/utils/DataTypeUtils.h:67: Invalid data type this error repeats hundreds of time, then:

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.71 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: ConvertConstDequantisationLayersToConstLayersImpl::ReplaceConstDequantisationLayer() Info: constantInfo datatype:Float16inputDequantizeInfo datatype:Float16outputDequantizeInfo datatype:Float32 Info: ConvertConstDequantisationLayersToConstLayersImpl:: Converting FP16 -> FP32 this last one repeats many many times, then:

INFO: TfLiteArmnnDelegate: Added backend CpuAcc

RuntimeError Traceback (most recent call last) Cell In[22], line 11 7 armnn_delegate = tflite.experimental.load_delegate( library="/home/federico/Documents/code/ARM/aarch64_build/delegate/libarmnnDelegate.so", 8 options={"backends": "CpuAcc", "logging-severity":"trace"}) 10 # Delegates/Executes all operations supported by Arm NN to/with Arm NN ---> 11 interpreter = tflite.Interpreter(model_path="../models/feather_16bit.tflite", experimental_delegates=[armnn_delegate]) 13 interpreter.allocate_tensors() 15 # Get input and output tensors.

File ~/miniconda3/envs/mlc-chat-venv/lib/python3.11/site-packages/tensorflow/lite/python/interpreter.py:513, in Interpreter.init(self, model_path, model_content, experimental_delegates, num_threads, experimental_op_resolver_type, experimental_preserve_all_tensors, experimental_disable_delegate_clustering) 511 self._delegates = experimental_delegates 512 for delegate in self._delegates: --> 513 self._interpreter.ModifyGraphWithDelegate( 514 delegate._get_native_delegate_pointer()) # pylint: disable=protected-access 515 self._signature_defs = self.get_signature_list() 517 self._metrics = metrics.TFLiteMetrics()

RuntimeError: TfLiteArmnnDelegate: Exception (Failed to assign a backend to each layer) caught from optimize.

type:Float32 Info: ConvertConstDequantisationLayersToConstLayersImpl:: Converting FP16 -> FP32 Info: ConvertConstDequantisationLayersToConstLayersImpl::ReplaceConstDequantisationLayer() Info: constantInfo datatype:Float16inputDequantizeInfo datatype:Float16outputDequantizeInfo datatype:Float32 this last message repeats many many times, then:

Info: Optimize ArmnnSubgraph time: 3.51 ms Info: Load ArmnnSubgraph time: 1.09 ms Info: Overall ArmnnSubgraph creation time: 5.57 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.04 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Warning: WARNING: Layer of type Convolution2d is not supported on requested backend CpuAcc for input data type Float32 and output data type Float32 (reason: in validate src/runtime/NEON/functions/NEConvolutionLayer.cpp:134: Dynamic weights are not supported), falling back to the next backend. Warning: ERROR: Layer of type Convolution2d is not supported on any preferred backend [CpuAcc ] Warning: WARNING: Layer of type Convolution2d is not supported on requested backend CpuAcc for input data type Float32 and output data type Float32 (reason: in validate src/runtime/NEON/functions/NEConvolutionLayer.cpp:134: Dynamic weights are not supported), falling back to the next backend. Warning: ERROR: Layer of type Convolution2d is not supported on any preferred backend [CpuAcc ] Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 39 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 40

CpuRef:

This one WORKS. Yet there are warnings:

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.67 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: ConvertConstDequantisationLayersToConstLayersImpl::ReplaceConstDequantisationLayer() Info: constantInfo datatype:Float16inputDequantizeInfo datatype:Float16outputDequantizeInfo datatype:Float32 Info: ConvertConstDequantisationLayersToConstLayersImpl:: Converting FP16 -> FP32 this message happens many many times, then:

Info: Optimize ArmnnSubgraph time: 3.32 ms Info: Load ArmnnSubgraph time: 0.40 ms Info: Overall ArmnnSubgraph creation time: 4.73 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.05 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.46 ms Info: Load ArmnnSubgraph time: 0.09 ms Info: Overall ArmnnSubgraph creation time: 0.70 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.09 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.49 ms Info: Load ArmnnSubgraph time: 0.11 ms Info: Overall ArmnnSubgraph creation time: 0.75 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.19 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 1.25 ms Info: Load ArmnnSubgraph time: 0.21 ms Info: Overall ArmnnSubgraph creation time: 1.75 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.04 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.29 ms Info: Load ArmnnSubgraph time: 0.05 ms Info: Overall ArmnnSubgraph creation time: 0.43 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.09 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.58 ms Info: Load ArmnnSubgraph time: 0.11 ms Info: Overall ArmnnSubgraph creation time: 0.84 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.04 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.30 ms Info: Load ArmnnSubgraph time: 0.06 ms Info: Overall ArmnnSubgraph creation time: 0.44 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.07 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.44 ms Info: Load ArmnnSubgraph time: 0.09 ms Info: Overall ArmnnSubgraph creation time: 0.64 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.04 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.31 ms Info: Load ArmnnSubgraph time: 0.06 ms Info: Overall ArmnnSubgraph creation time: 0.45 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.08 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.54 ms Info: Load ArmnnSubgraph time: 0.10 ms Info: Overall ArmnnSubgraph creation time: 0.78 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.06 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.52 ms Info: Load ArmnnSubgraph time: 0.10 ms Info: Overall ArmnnSubgraph creation time: 0.74 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.06 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.45 ms Info: Load ArmnnSubgraph time: 0.10 ms Info: Overall ArmnnSubgraph creation time: 0.67 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.03 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.30 ms Info: Load ArmnnSubgraph time: 0.06 ms Info: Overall ArmnnSubgraph creation time: 0.43 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.09 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.50 ms Info: Load ArmnnSubgraph time: 0.10 ms Info: Overall ArmnnSubgraph creation time: 0.74 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.06 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.53 ms Info: Load ArmnnSubgraph time: 0.10 ms Info: Overall ArmnnSubgraph creation time: 0.75 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.07 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.47 ms Info: Load ArmnnSubgraph time: 0.10 ms Info: Overall ArmnnSubgraph creation time: 0.70 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.03 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.30 ms Info: Load ArmnnSubgraph time: 0.07 ms Info: Overall ArmnnSubgraph creation time: 0.44 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.02 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.21 ms Info: Load ArmnnSubgraph time: 0.04 ms Info: Overall ArmnnSubgraph creation time: 0.31 ms

Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 19 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 20 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 21 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 22 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 23 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 24 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 25 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 26 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 27 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 28 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 29 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 30 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 31 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 32 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 33 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 34 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 35 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 36 INFO: TfLiteArmnnDelegate: Added backend CpuRef WARNING: CAST: not supported by armnn: Reference cast: input is not a supported type

WARNING: CAST: not supported by armnn: Reference cast: input is not a supported type

Again CpuRef even with all these messages, does work. CpuAcc and GpuAcc do not work. I'm building using the main branch after pulling all changes as of today and applying your patch.

federicoparra commented 2 months ago

Using instead the feather_8bit_dynamic.tflite, both CpuAcc and CpuRef work (and indeed CpuAcc works pretty fast on my orange pi 5b !). But GpuAcc still has the errors:

Error: ArmNN Failed to visit node with error: in data_size_from_type ./arm_compute/core/utils/DataTypeUtils.h:67: Invalid data type this error appears many many times, then: Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.07 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.51 ms Info: Load ArmnnSubgraph time: 331.17 ms Info: Overall ArmnnSubgraph creation time: 331.88 ms

Info: ArmnnSubgraph creation Info: Parse nodes to ArmNN time: 0.11 ms Debug: OptimizerOptions: ReduceFp32ToFp16: 0 ReduceFp32ToBf16: 0 Debug: 0 Debug to file: 0 ShapeInferenceMethod: ValidateOnly ImportEnabled: 0 ExportEnabled: 0 ProfilingEnabled: 0 AllowExpandedDims: 0 ModelOptions:

Info: Optimize ArmnnSubgraph time: 0.65 ms Error: An error occurred when preparing the network workloads: Convolution2dQueueDescriptor: input & weight must have identical data types. Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 106 Debug: RuntimeImpl::UnloadNetwork(): Unloaded network with ID: 135 INFO: TfLiteArmnnDelegate: Added backend GpuAcc WARNING: FULLY_CONNECTED: not supported by armnn: in validate src/gpu/cl/operators/ClFullyConnected.cpp:473: Tensors have different data types

RuntimeError Traceback (most recent call last) Cell In[55], line 11 7 armnn_delegate = tflite.experimental.load_delegate( library="/home/federico/Documents/code/ARM/aarch64_build/delegate/libarmnnDelegate.so", 8 options={"backends": "GpuAcc", "logging-severity":"trace"}) 10 # Delegates/Executes all operations supported by Arm NN to/with Arm NN ---> 11 interpreter = tflite.Interpreter(model_path="../models/feather_8bit_dynamic.tflite", experimental_delegates=[armnn_delegate]) 13 interpreter.allocate_tensors() 15 # Get input and output tensors.

File ~/miniconda3/envs/mlc-chat-venv/lib/python3.11/site-packages/tensorflow/lite/python/interpreter.py:513, in Interpreter.init(self, model_path, model_content, experimental_delegates, num_threads, experimental_op_resolver_type, experimental_preserve_all_tensors, experimental_disable_delegate_clustering) 511 self._delegates = experimental_delegates 512 for delegate in self._delegates: --> 513 self._interpreter.ModifyGraphWithDelegate( 514 delegate._get_native_delegate_pointer()) # pylint: disable=protected-access 515 self._signature_defs = self.get_signature_list() 517 self._metrics = metrics.TFLiteMetrics()

RuntimeError: TfLiteArmnnDelegate: Exception (TfLiteArmnnDelegate: Network could not be loaded: An error occurred when preparing the network workloads: Convolution2dQueueDescriptor: input & weight must have identical data types.) caught from LoadNetwork.

federicoparra commented 2 months ago

Ok, the feather_32bit.tflite does work! with both GpuAcc, CpuAcc and CpuRef, great work!

However, by comparing the speed of inference comparing the 32, 16 and 8 bits versions of the model using CpuRef (the only backend with which all 3 work) one can see that the 8 bit is much much faster - and so that presumably the 16 and 8 bits versions would also be faster if they would run (instead of crashing) on the CpuAcc and GpuAcc backends.

So if you could check the errors, particularly those related to GpuAcc backend on the feather_8bit_dynamic.tflite version that would be so helpful!

thank you! FEderico

Colm-in-Arm commented 2 months ago

Hello Federico,

Just a point on feather_16bit.tflite. This is not a FP16 model. The input is FP32. The weights are quantized FP16 with the first operator on them being dequantize to FP32. TfLite have the concept of post training quantization. It is up to the backend/accelerator to identify this structure and to fully modify the model to use FP16 kernels. I believe the TfLite GPU backend does support this but Arm NN does not.

The result is that you will not see any performance increase between FP16 and FP32 with this model on Arm NN.

Colm.

federicoparra commented 2 months ago

Hello Federico,

Just a point on feather_16bit.tflite. This is not a FP16 model. The input is FP32. The weights are quantized FP16 with the first operator on them being dequantize to FP32. TfLite have the concept of post training quantization. It is up to the backend/accelerator to identify this structure and to fully modify the model to use FP16 kernels. I believe the TfLite GPU backend does support this but Arm NN does not.

The result is that you will not see any performance increase between FP16 and FP32 with this model on Arm NN.

Colm.

Thanks @Colm-in-Arm . Three follow ups:

1) how about 8-bit quantized tflite models? Does armnn accelerate those with GpuAcc? 2) would it make sense to transform the model entirely to fp16 prior to the conversion to tflite? Would the tflite models in that case really use fp16 kernels? 3) any idea about the errors I'm having with the fp16 and 8bit versions of the model?

Thank you!!!

Federico

Colm-in-Arm commented 2 months ago

1: Yes Arm NN will accelerate an 8 bit quantized model using the GpuAcc and CpuAcc backends. The usual restrictions on supported layers apply.

2: I've never attempted the kinds of conversions you're doing. I would hope that Tensorflow would honour an FP16 input model and create a native FP16 tflite model. I would be happy to try it if you get the conversion to work.

3: feather_8bit_dynamic.tflite on GpuAcc

Something is going wrong with the ACL layer validation here. It is returning that this CONV2D layer is supported but in Arm NN we have 3 different reasons why this workload should not be created. When I remove these restrictions the layer fails in ACL. I'll have to check with @morgolock

feather_16bit.tflite on CpuAcc and GpuAcc

I can see the model is executing but the results are garbage. This will require a layer by layer comparison which will take some time.

Colm.

federicoparra commented 2 months ago

1: Yes Arm NN will accelerate an 8 bit quantized model using the GpuAcc and CpuAcc backends. The usual restrictions on supported layers apply.

2: I've never attempted the kinds of conversions you're doing. I would hope that Tensorflow would honour an FP16 input model and create a native FP16 tflite model. I would be happy to try it if you get the conversion to work.

3: feather_8bit_dynamic.tflite on GpuAcc

Something is going wrong with the ACL layer validation here. It is returning that this CONV2D layer is supported but in Arm NN we have 3 different reasons why this workload should not be created. When I remove these restrictions the layer fails in ACL. I'll have to check with @morgolock

feather_16bit.tflite on CpuAcc and GpuAcc

I can see the model is executing but the results are garbage. This will require a layer by layer comparison which will take some time.

Colm.

Thank you! given the 16 bit model will not improve speed with regards to the 32 bit model, then the most important model version to make work is the 8bit version.

As a general rule: what is faster in ARMNN (in mali GpuAcc)? 8bit models or 16/32 bit models?

thanks! Federico

federicoparra commented 2 months ago

Hey @Colm-in-Arm good news is: I created a new version of the converted model, using these guidelines https://www.tensorflow.org/lite/performance/post_training_integer_quant#convert_using_integer-only_quantization (except that I did not quantize the inputs and outputs), leading to this model: feather_8bit_inside.zip This form of conversion makes sure every single operation inside the model is an int8 operator/kernel (with the exception of the input/output casting operations since the input/outputs are floats).

The good news: ARMNN with your patch shared in this bug report does work with this 8bit model! It reports several errors both in GpuAcc and CpuAcc but it does work anyhow (errors below).

Now the bad news: it's significantly slower than the float version. And it is slower with GpuAcc and faster with CpuAcc. This observation I have made before: I found several times that float models work faster than int models in ARMNN and that int models run faster in CpuAcc than GpuAcc, which I took to mean that int models were useless in the sense that Mali was not prepared to accelerate them or not as much as float operations. Can you confirm this? I'm trying to focus my effort on creating the fastest possible models for Mali GPU.

Here are the errors I see with this model when loading with CpuAcc (first) and GpuAcc (second):

CpuAcc errors: Error: ArmNN Failed to visit node with error: in data_size_from_type ./arm_compute/core/utils/DataTypeUtils.h:67: Invalid data type that error repeats hundreds of times

GpuAcc: Error: ArmNN Failed to visit node with error: in data_size_from_type ./arm_compute/core/utils/DataTypeUtils.h:67: Invalid data type WARNING: ELEMENTWISE_UNARY: not supported by armnn: in validate_arguments src/gpu/cl/kernels/ClElementwiseUnaryKernel.cpp:65: ITensor data type QASYMM8_SIGNED not supported by this kernel WA error repeats many times, then: Error: ArmNN Failed to visit node with error: in data_size_from_type ./arm_compute/core/utils/DataTypeUtils.h:67: Invalid data type error repeats dozens of times, then: WARNING: ELEMENTWISE_UNARY: not supported by armnn: in validate_arguments src/gpu/cl/kernels/ClElementwiseUnaryKernel.cpp:65: ITensor data type QASYMM8_SIGNED not supported by this kernel warning repeats around 6 times

I repeat, both still do work, they just work slower than the float versions on either backend

Colm-in-Arm commented 2 months ago

Hello Federico,

It's never as simple as INT8 is slower on GpuAcc than FP32. There are a multitude of factors involved. Something to consider: the first time a GPU inference happens the kernels are compiled. You should probably disregard the first inference. If you run 10 iterations, ignore the first and compare the execution to CpuAcc you'll get a better impression of the relative speeds. You can avoid this initial overhead by caching a previous tuning level data (see save-cached-network, cached-network-filepath, tuning-level and tuning-path options.)

There is a script delivered with ExecuteNetwork, evaluate_network.sh, that will try to find the fastest way to run a model inference. From memory, it requires the model to be supported by the parser so not much use for this model.

The most likely cause of the errors from ./arm_compute/core/utils/DataTypeUtils.h:67 is a Boolean Datatype. Part of the review was to allow the Boolean data type to propagate down to ACL for it to be rejected by the validate method. This is the main reason I've not progressed this review. I need to work on a better way to do this. Once the layer is rejected by ACL it will fall back to TfLite runtime.

Colm.

ARM-software / armnn

Another error on another model and by now it's two out of three :( #762

Delegates/Executes all operations supported by Arm NN to/with Arm NN

and then: