VeriSilicon / tflite-vx-delegate

Tensorflow Lite external delegate based on TIM-VX
MIT License
42 stars 23 forks source link

Mismatch between CPU and NPU in simple Conv2D #210

Open deneriz-veridas opened 3 weeks ago

deneriz-veridas commented 3 weeks ago

Hi,

We have been working with the VX delegate to execute TFLite models on the NPU of the i.MX 8M Plus which is a VeriSilicon's VIPNano-SI+. Doing so, we have found that there are mismatches between the execution of the model in CPU and in NPU, even with a model with a single Conv2D with a 3x3 kernel and padding 'same'. This plot shows the distribution of this mismatch.

image

Even more, this mismatch errors propagate along different layers across the model. This file (conv-sequence.zip) contains the descomposition of a model with 20 Conv2D layers into 20 models, each of them adding one layer to the previous one, allowing the measurement of the mismatch after each of the layers. The following plot shows this propagation across the model.

image

Is there a way to avoid this mismatch? Is this a known issue with this NPU?

We are using TFLite Runtime 2.9.1.1 and the forked iMX delegate under version lf-5.15.71_2.2.0.

jetxeberria commented 3 weeks ago

I'm also seeing mismatch between CPU and NPU executions. This is very annoying! Help please!

sunshinemyson commented 2 weeks ago

@deneriz-veridas @jetxeberria ,

Thanks for your feedback. Very nice data analysis. Our NPU integer math is not bit-accurate compare to tflite CPU implementation - for single layer 1-bit distance.

In our practise, the difference doesn't impact the top-1 accuracy in mobilenet-v1. we usually check the result from application POV such as label/box, not compare the absolute error between cpu and npu.

deneriz-veridas commented 1 week ago

Hi @sunshinemyson,

Thanks for having a look to this issue. I understand this errors can have minimal impact in classification applications. However, we working with a model that generates embeddings, which we use to compute the distance between them. In this application, the errors are much more important.

Could you extend more on the bit-accuracy of the NPU integer math? Do you have characterized when this happens? We are looking for a way to avoid or mitigate this. Thanks in advance!