Multiple downstream outputs bug

bkovalenkocomp commented 2 years ago

Sorry for making duplicates. Im not sure bug is in vx-delegate or in TIM-VX. https://github.com/VeriSilicon/tflite-vx-delegate/issues/32

Hi, I think I found a bug in the vx-delegate runtime.

setup: A311D + Android 9 + TensorFlow Lite with vx-delegate

Model: detector, that outputs multiple things: BBoxes, Landmarks, Probability scores and features vectors.

Problem: Landmarks outputs are garbage. How model works for landmarks:

input image -> backbone -> FPN -> Conv layers that produces features (OUTPUT 1) -> Conv layers that produces landmarks (OUTPUT 2)

so, if I have two outputs, that are downstream one after another, second outputs is not calculated and I get garbage output. The problem is only with INT8 graph, if I use FP32 graph, it works fine with vx-delegate.

On x86 with standard TFLite (and xnnpack) everything works fine with both INT8 and FP32 graphs.

update: Downstream is not important, even if in landmarks branch of graph I have only landmarks as outputs, I get garbage. I don't know why, but part of the graph with landmarks is not calculated on the NPU.

What could be the problem? Thanks.

sunshinemyson commented 2 years ago

I tried yolo-v3-tiny(two output) on imx8.plus, it can pass. Would you mind sharing the model with us?

bkovalenkocomp commented 2 years ago

I tried yolo-v3-tiny(two output) on imx8.plus, it can pass. Would you mind sharing the model with us?

I can't share model with trained weights, but I can share model filled with 1. The architecture will be the same, but I'm not sure the problem reproduces there. Could you give me verisillicon email please.

or here is a link https://disk.yandex.ru/d/4j1PSSpVqe0VLg its 8bit tensor quantised model, originally filled with 1

the total number of outputs in my graph is 6 (two sets of anchors, probs, bboxes, landmarks), but with features vectors number of outputs is 10

sunshinemyson commented 2 years ago

It can run with my imx8plus with 6ms inference time. I think maybe you need to update your low-level driver on A311D + Android 9. We don't have Android prebuilt for A311D, but the only linux version .

You can try it and export "VIV_VX_DEBUG_LEVEL=1" && "VSI_NN_LOG_LEVEL=5" if still failed

bkovalenkocomp commented 2 years ago

It can run with my imx8plus with 6ms inference time. I think maybe you need to update your low-level driver on A311D + Android 9. We don't have Android prebuilt for A311D, but the only linux version .

You can try it and export "VIV_VX_DEBUG_LEVEL=1" && "VSI_NN_LOG_LEVEL=5" if still failed

on A311D + Linux - I observed the same problem. With 6.4.8 version on linux and 6.4.3 on Android

bkovalenkocomp commented 2 years ago

on Android export VIV_VX_DEBUG_LEVEL=1 && export VSI_NN_LOG_LEVEL=5 didn't change output log. Everything looks fine, probability and bboxes look fine, but landmarks are not calculated or not copied from NPU.

...
Creating Conv2d op
Creating Concatenation op
Create Transpose op
Creating Reshape op
Create Transpose op
Creating softmax op
Create Transpose op
Creating Reshape op
Creating Dequantize op
Create Transpose op
Creating Reshape op
Create Transpose op
Creating softmax op
Create Transpose op
Creating Reshape op
Creating Dequantize op
Create Transpose op
Creating Dequantize op
Create Transpose op
Creating Dequantize op
Create Transpose op
Creating Dequantize op
Create Transpose op
Creating Dequantize op
...
Verifying graph
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
Verified graph
...
Delegate::Invoke node:0x9eb22b80
Copying input 0:data:0
Invoking graph
Copying output 529:face_rpn_cls_score_stride32_out:0
Copying output 536:face_rpn_cls_score_stride16_out:0
Copying output 538:transpose_5:0
Copying output 540:transpose_6:0
Copying output 542:transpose_7:0
Copying output 544:transpose_8:0
...

bkovalenkocomp commented 2 years ago

It seems that the landmarks branch in graph is calculated, because the features from this branch of graph look ok. The downstream landmarks tensors are not. Maybe there is some kind of limit on depth?

sunshinemyson commented 2 years ago

@bkovalenkocomp ， No, we don't have such limitation. Is it possible create a smaller graph only include landmark branch? I think this could be helpful narrow down the issue. Maybe we have some mistake on special layer mapping.

BTW, for android you need use setprop instead of export.

bkovalenkocomp commented 2 years ago

@bkovalenkocomp ， No, we don't have such limitation. Is it possible create a smaller graph only include landmark branch? I think this could be helpful narrow down the issue. Maybe we have some mistake on special layer mapping.

BTW, for android you need use setprop instead of export.

here is model that outputs 2 tensors, that are landmarks https://disk.yandex.ru/d/eNPrMgv8zyoCrA

here is full debug log https://disk.yandex.ru/d/toigPqKSUuIsqA

sunshinemyson commented 2 years ago

@bkovalenkocomp ,

We will check your model. Thanks

sunshinemyson commented 2 years ago

@bkovalenkocomp , we tried your model from https://disk.yandex.ru/d/eNPrMgv8zyoCrA with CPU, output appears to all zeros. I'll upload a script for test purpose soon. Could you double check it with CPU then?

bkovalenkocomp commented 2 years ago

@bkovalenkocomp , we tried your model from https://disk.yandex.ru/d/eNPrMgv8zyoCrA with CPU, output appears to all zeros. I'll upload a script for test purpose soon. Could you double check it with CPU then?

Yes, all zeros output is ok, it's a fake model, it has the same arch, but filled with 0/small random numbers.

sunshinemyson commented 2 years ago

I create a run_model.py in PR https://github.com/VeriSilicon/tflite-vx-delegate/pull/37/files You can try it.

sunshinemyson commented 2 years ago

Maybe you can try your real model with my tool. I suppose it could be easier for debug.

bkovalenkocomp commented 2 years ago

Maybe you can try your real model with my tool. I suppose it could be easier for debug. @sunshinemyson

I have c++ program with tflite and vx-delegate that runs model on A311D. The same program works fine on x86 with tflite + xnnpack and Android Pixel phones with tflite + NNAPI.

bkovalenkocomp commented 2 years ago

@sunshinemyson

output for my model for A311D TFLite + XNNPACK on CPU:

Score: 0.996094
BBox out: 144.205 180.517 289.564 393.582
LM: 147.492 267.938
LM: 147.492 286.323
LM: 147.492 304.708
LM: 151.169 323.092
LM: 158.523 341.477
LM: 165.877 356.185
LM: 176.907 370.893
LM: 187.938 381.924
LM: 206.323 385.601
LM: 228.385 385.601
LM: 246.769 374.57
LM: 265.154 363.539
LM: 283.539 345.154
LM: 290.893 326.769
LM: 298.247 304.708
LM: 298.247 282.646
LM: 298.247 260.584
LM: 143.815 253.231
LM: 151.169 242.2
LM: 162.2 238.523
...

output for my model for A311D TFLite + VX-Delegate on NPU:

Score: 0.996094
BBox out: 142.731 185.545 289.715 390.521
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -184.326
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -199.034
LM: -319.034 -195.357
LM: -319.034 -199.034
LM: -308.003 -199.034

the model is INT8 tensor quantized and outputs bbox detections + landmarks + prob score

bkovalenkocomp commented 2 years ago

@sunshinemyson

here is sample (not 1to1 correspondence) from output tensor for my model (that is tensor without any post processing, directly out of model):

A311D TFLite + VX-Delegate on NPU:

Landmarks tensor: -2.61472
Landmarks tensor: -2.69643
Landmarks tensor: -2.61472
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.65557
Landmarks tensor: -2.65557
Landmarks tensor: -2.55344
Landmarks tensor: -2.69643
Landmarks tensor: -2.69643
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.676
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.47173
Landmarks tensor: -2.53301
Landmarks tensor: -2.51258
Landmarks tensor: -2.53301
Landmarks tensor: -2.51258
Landmarks tensor: -2.53301
Landmarks tensor: -2.63514
Landmarks tensor: -2.55344
Landmarks tensor: -2.57386
Landmarks tensor: -2.53301
Landmarks tensor: -2.49215
Landmarks tensor: -2.53301
Landmarks tensor: -2.41044
Landmarks tensor: -2.55344
Landmarks tensor: -2.55344
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.43087
Landmarks tensor: -2.71685
Landmarks tensor: -2.55344
Landmarks tensor: -2.71685
Landmarks tensor: -2.59429
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.63514
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.63514
Landmarks tensor: -2.71685
Landmarks tensor: -2.69643
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.676
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.61472
Landmarks tensor: -2.71685
Landmarks tensor: -2.57386
Landmarks tensor: -2.71685
Landmarks tensor: -2.69643
Landmarks tensor: -2.676
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.59429
Landmarks tensor: -2.69643
Landmarks tensor: -2.71685
Landmarks tensor: -2.676
Landmarks tensor: -2.71685
Landmarks tensor: -2.69643
Landmarks tensor: -2.16531
Landmarks tensor: -2.71685
Landmarks tensor: -2.65557
Landmarks tensor: -2.676
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.676
Landmarks tensor: -2.0836
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.69643
Landmarks tensor: -2.71685
Landmarks tensor: -2.65557
Landmarks tensor: -2.71685
Landmarks tensor: -2.71685
Landmarks tensor: -2.69643

A311D TFLite + XNNPACK on CPU:

Landmarks tensor: 0.49026
Landmarks tensor: -0.367695
Landmarks tensor: 0.469832
Landmarks tensor: -0.285985
Landmarks tensor: 0.40855
Landmarks tensor: -0.142992
Landmarks tensor: 0.40855
Landmarks tensor: -0.0612824
Landmarks tensor: 0.40855
Landmarks tensor: 0.0204275
Landmarks tensor: 0.428977
Landmarks tensor: -0.040855
Landmarks tensor: 0.49026
Landmarks tensor: -0.142992
Landmarks tensor: 0.510687
Landmarks tensor: -0.24513
Landmarks tensor: 0.510687
Landmarks tensor: -0.531114
Landmarks tensor: -0.102137
Landmarks tensor: -0.531114
Landmarks tensor: 0.0204275
Landmarks tensor: -0.510687
Landmarks tensor: 0.142992
Landmarks tensor: -0.510687
Landmarks tensor: 0.24513
Landmarks tensor: -0.469832
Landmarks tensor: 0.367695
Landmarks tensor: -0.428977
Landmarks tensor: 0.469832
Landmarks tensor: -0.347267
Landmarks tensor: 0.571969
Landmarks tensor: -0.24513
Landmarks tensor: 0.633252
Landmarks tensor: -0.122565
Landmarks tensor: 0.633252

unfortunately fake zero model doesn't reproduce the problem.

sunshinemyson commented 2 years ago

https://github.com/VeriSilicon/TIM-VX/releases/download/v1.1.34.fix/aarch64_A311D_6.4.8.tgz Just double confirm that you are taking the 6.4.8 driver or not?

bkovalenkocomp commented 2 years ago

https://github.com/VeriSilicon/TIM-VX/releases/download/v1.1.34.fix/aarch64_A311D_6.4.8.tgz Just double confirm that you are taking the 6.4.8 driver or not?

@sunshinemyson

I observe problem both on _arm_android9_A311D6.4.3 and _aarch64_A311D6.4.8

_aarch64_A311D6.4.8, in file VERSION there is: REL/6.4.8 _arm_android9_A311D6.4.3, in file VERSION there is: Release ID: 6.4.3+1

sunshinemyson commented 2 years ago

@bkovalenkocomp ,

We have a layer dump may help you debug this issue: export VIV_VX_DEBUG_LEVEL=1 export NN_LAYER_DUMP=1

Dump file saved as readable text file as the same directory of your executable.

bkovalenkocomp commented 2 years ago

@bkovalenkocomp ,

We have a layer dump may help you debug this issue: export VIV_VX_DEBUG_LEVEL=1 export NN_LAYER_DUMP=1

Dump file saved as readable text file as the same directory of your executable.

here is dump https://disk.yandex.ru/d/jcx9JsUf_15DaQ I removed first 50 layers.

244_TensorCopy_operation_244.txt and 245_TensorCopy_operation_245.txt are filled with constants

sunshinemyson commented 2 years ago

@bkovalenkocomp ,

Thanks. We also trying to have some layer dump tool from tflite-vx-delegate level, this could make the debug easier. Hopefully, we can have this tool in this week, then, we can have some conclusion on your issue.

Thanks for your patience.

sunshinemyson commented 2 years ago

@bkovalenkocomp ,

We have disabled "resize bilinear to transpose conv" in Tflite-vx-delegate last commit. Would you let us know if it works for your model or not? thanks

bkovalenkocomp commented 2 years ago

@bkovalenkocomp ,

We have disabled "resize bilinear to transpose conv" in Tflite-vx-delegate last commit. Would you let us know if it works for your model or not? thanks

I tried with bool can_resize_to_transposeconv = false; unfortunately that didn't help ;-/

https://github.com/VeriSilicon/tflite-vx-delegate/blob/main/op_map.cc#L1020

bkovalenkocomp commented 2 years ago

@sunshinemyson We need that fix very much ;-), if you have nda with Amlogic and connection there, they can provide additional info with SH-8759

sunshinemyson commented 2 years ago

@bkovalenkocomp ,

Got it. Let me check it internally.

sunshinemyson commented 2 years ago

@bkovalenkocomp ,

I get model from AML, we confirmed there is sw issue in TIM-VX, will fix it ASAP. Before that, you can remove the transpose+dequantize operators in your model for the incorrect branch if it feasible to you.

sunshinemyson commented 2 years ago

@bkovalenkocomp ， https://github.com/VeriSilicon/TIM-VX/pull/250 the fix is under review now.

bkovalenkocomp commented 2 years ago

@bkovalenkocomp ，

250 the fix is under review now.

thank you very much for your help, it works now, with awesome speed

but ;-) the output is still a little bit different, I will double check and create new issue or, probably, related to https://github.com/VeriSilicon/tflite-vx-delegate/issues/35

bkovalenkocomp commented 2 years ago

thanks!

bkovalenkocomp commented 2 years ago

@bkovalenkocomp ，

250 the fix is under review now.

thank you very much for your help, it works now, with awesome speed

but ;-) the output is still a little bit different, I will double check and create new issue or, probably, related to VeriSilicon/tflite-vx-delegate#35

FYI:

I used the latest master for TIM-VX and VX-Delegate

bkovalenkocomp commented 2 years ago

FYI problem looks like this:

there are 2 faces on the image - big one and small one.

on x86 probability scores were: 0.99 for big one and 0.98 for small one on a311d npu scores are 0.74 for big one and 0.98 for small one

Other outputs look fine. Things look fine, but the difference in the scores is suspicions.

VeriSilicon / TIM-VX

Multiple downstream outputs bug #226

250 the fix is under review now.

250 the fix is under review now.