PaddlePaddle / FastDeploy

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
https://www.paddlepaddle.org.cn/fastdeploy
Apache License 2.0
3.01k stars 465 forks source link

rv1126使用官方例子测试,跑的很慢 #1689

Closed 16lwzheng closed 2 months ago

16lwzheng commented 1 year ago

D [vsi_nn_kernel_selector:1265]Instance OPENVX node with kernel "conv2d" D [compute_node:377]Instance node[135] "SWISH" ... D [vsi_nn_kernel_selector:1265]Instance OPENVX node with kernel "swish" D [compute_node:377]Instance node[136] "CONV2D" ... D [vsi_nn_kernel_selector:1265]Instance OPENVX node with kernel "conv2d" D [compute_node:377]Instance node[137] "SWISH" ... D [vsi_nn_kernel_selector:1265]Instance OPENVX node with kernel "swish" D [compute_node:377]Instance node[138] "CONCAT" ... D [compute_node:377]Instance node[139] "CONV2D" ... D [vsi_nn_kernel_selector:1265]Instance OPENVX node with kernel "conv2d" D [compute_node:377]Instance node[140] "SWISH" ... D [vsi_nn_kernel_selector:1265]Instance OPENVX node with kernel "swish" D [compute_node:377]Instance node[141] "CONV2D" ... D [vsi_nn_kernel_selector:1265]Instance OPENVX node with kernel "conv2d" [3 3/23 3: 8:50.156 ...r/src/driver/verisilicon_timvx/engine.cc:191 Build] Build the tim-vx graph success. D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 3: 8:50.221 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 59198 us [3 3/23 3: 8:50.235 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 78457 us DetectionResult: [xmin, ymin, xmax, ymax, score, label_id] 103.668625,45.436493, 128.724014, 95.971954, 0.871636, 0 265.735657,81.108414, 300.017883, 173.137909, 0.839909, 0 156.409607,81.675583, 198.970245, 165.704285, 0.836028, 0 379.059906,40.282700, 395.753448, 82.530655, 0.829662, 0 361.590851,56.148071, 382.801788, 115.459198, 0.769370, 0 503.740662,117.393051, 593.712830, 270.002625, 0.755905, 0 328.305634,40.057297, 344.999176, 78.257736, 0.735399, 0 415.107574,88.961212, 505.079742, 286.959686, 0.706305, 0 580.929382,111.280991, 615.211609, 203.310486, 0.697555, 0 3.109375,149.812500, 39.140625, 172.250000, 0.573872, 24 351.475525,44.292389, 368.169067, 94.520966, 0.534139, 0 186.300156,45.818710, 200.014877, 60.994644, 0.510922, 0 54.859375,154.390625, 98.328125, 174.109375, 0.506969, 24 167.062500,84.718750, 395.984375, 343.468750, 0.498233, 33 168.894302,47.093414, 177.465988, 62.093231, 0.445955, 0 25.062500,117.843750, 58.796875, 153.296875, 0.405319, 24 68.062500,124.187500, 104.093750, 154.468750, 0.388880, 56 3.796875,134.453125, 41.875000, 154.171875, 0.304446, 24 65.421875,133.500000, 88.937500, 155.093750, 0.297584, 24

Preprocess time: 21782.973000 ms Postprocess time: 44861.748000 ms

Visualized result saved in ./vis_result.jpg [I 3/23 3: 8:50.615 ...r/src/driver/verisilicon_timvx/driver.cc:89 DestroyProgram] Destroy program for verisilicon_timvx.

测试log如上,加入了Postprocess time 看了推理的运行时间非常慢,整个例子跑完整差不多1分钟才跑完,不知道是那一步没弄好,都是照着官方教程部署的

DefTruth commented 1 year ago

可以写个循环来测一下,warmup 20,repeat 100这样子,统计下推理时间。一般首次推理都会慢一些

16lwzheng commented 1 year ago

Preprocess time: 347.448000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:20.110 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58794 us [3 3/23 6:56:20.121 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75316 us Preprocess time: 347.680000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:20.457 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58654 us [3 3/23 6:56:20.468 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 74938 us Preprocess time: 346.606000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:20.804 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58663 us [3 3/23 6:56:20.816 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 76359 us Preprocess time: 347.260000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:21.153 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58754 us [3 3/23 6:56:21.166 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 76760 us Preprocess time: 350.354000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:21.503 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 59709 us [3 3/23 6:56:21.515 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 77896 us Preprocess time: 348.635000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:21.850 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58644 us [3 3/23 6:56:21.862 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75740 us Preprocess time: 347.640000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:22.198 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58844 us [3 3/23 6:56:22.209 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75207 us Preprocess time: 346.896000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:22.545 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58569 us [3 3/23 6:56:22.557 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75530 us Preprocess time: 346.378000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:22.892 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58806 us [3 3/23 6:56:22.904 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 76556 us Preprocess time: 347.785000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:23.239 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58516 us [3 3/23 6:56:23.251 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75502 us Preprocess time: 345.799000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:23.585 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58626 us [3 3/23 6:56:23.599 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 77895 us Preprocess time: 347.877000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:23.934 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 59112 us [3 3/23 6:56:23.945 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75732 us Preprocess time: 348.837000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:24.286 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 59149 us [3 3/23 6:56:24.297 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75350 us Preprocess time: 350.275000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:24.633 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58723 us [3 3/23 6:56:24.647 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 78683 us Preprocess time: 349.131000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:24.982 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58657 us [3 3/23 6:56:24.993 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75305 us Preprocess time: 346.633000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:25.328 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58747 us [3 3/23 6:56:25.340 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75857 us Preprocess time: 348.879000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:25.678 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 59322 us [3 3/23 6:56:25.690 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 76882 us Preprocess time: 347.191000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:26. 25 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58687 us [3 3/23 6:56:26. 36 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 74896 us Preprocess time: 348.030000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:26.373 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58632 us [3 3/23 6:56:26.385 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 76085 us Preprocess time: 348.515000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:26.721 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58794 us [3 3/23 6:56:26.734 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 77664 us Preprocess time: 348.599000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:27. 70 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58699 us [3 3/23 6:56:27. 82 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 75679 us Preprocess time: 348.175000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:27.418 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 58650 us [3 3/23 6:56:27.431 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 76620 us Preprocess time: 346.909000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:27.766 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 59384 us [3 3/23 6:56:27.779 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 77642 us Preprocess time: 349.944000 ms D [_check_swapped_tensors:114]Check swapped tensors [3 3/23 6:56:28.116 ...r/src/driver/verisilicon_timvx/engine.cc:258 Execute] Process cost 59297 us [3 3/23 6:56:28.129 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 77735 us

套了循环一直推理,大概都在300多ms左右,这是正常的嘛,正常用paddlelite的例子也只有20ms

16lwzheng commented 1 year ago

用的是官网给的模型wget https://bj.bcebos.com/fastdeploy/models/yolov5s_ptq_model.tar.gz

yeliang2258 commented 1 year ago

请问是用adb跑的还是ssh跑的呢?是否用了FastDeploy提供的run_with_adb.sh脚本

16lwzheng commented 1 year ago

直接在板子上跑的,跟着run_with_adb.sh脚本里面的命令直接在板子上运行的

yeliang2258 commented 1 year ago

有加下面这些环境变量吗? export VIV_VX_ENABLE_GRAPH_TRANSFORM=-pcq:1 export VIV_VX_SET_PER_CHANNEL_ENTROPY=100

16lwzheng commented 1 year ago

有加下面这些环境变量吗? export VIV_VX_ENABLE_GRAPH_TRANSFORM=-pcq:1 export VIV_VX_SET_PER_CHANNEL_ENTROPY=100root@firefly:/home/firefly/FastDeploy/examples/vision/detection/yolov5/rv1126/cpp/build/install# echo $VIV_VX_ENABLE_GRAPH_TRANSFORM -pcq:1 root@firefly:/home/firefly/FastDeploy/examples/vision/detection/yolov5/rv1126/cpp/build/install# echo $VIV_VX_SET_PER_CHANNEL_ENTROPY 100

已经设置了的

yeliang2258 commented 1 year ago

FastDeploy在RV1126上使用的是Paddle Lite后端来跑的,应该速度一样的才对,你可以试试将模型换成Paddle Lite提供的模型来测试一下吗?

16lwzheng commented 1 year ago

auto model_file = model_dir + sep + "model.pdmodel"; auto params_file = model_dir + sep + "model.pdiparams"; auto subgraph_file = model_dir + sep + "subgraph.txt"; 我对fastdeploy的这种模型不太熟悉,我在paddlelite是直接用nb模型的,能否提供一个测试模型来运行

yeliang2258 commented 1 year ago

https://www.paddlepaddle.org.cn/lite/v2.12/demo_guides/verisilicon_timvx.html YOLOv5这个模型Paddle Lite给的数据是200ms 可以在这里下载模型:https://paddlelite-demo.bj.bcebos.com/models/yolov5s_int8_640_per_channel.tar.gz image

16lwzheng commented 1 year ago

这个数据看起来比较慢,不太符合业务要求,模型是可以转成in8这种的吗

yeliang2258 commented 1 year ago

现在跑的已经是int8量化后的模型,看看picodet是否满足要求,另外也可以采取剪枝,蒸馏等方法进一步压缩模型

16lwzheng commented 1 year ago

[3 3/23 10:23:29. 45 ...le-Lite/lite/kernels/nnadapter/engine.cc:248 Execute] Process cost 76681 us Preprocess time: 1176.785000 ms DetectionResult: [xmin, ymin, xmax, ymax, score, label_id] 用了上面的模型运行时间是1176ms,已经预热了100次,貌似没有用npu推理只用了cpu

16lwzheng commented 1 year ago

[3 3/23 10:21:31.957 ...r/src/driver/verisilicon_timvx/engine.cc:191 Build] Build the tim-vx graph success. [I 3/23 10:23:30.766 ...r/src/driver/verisilicon_timvx/driver.cc:89 DestroyProgram] Destroy program for verisilicon_timvx.

erroot commented 1 year ago

FastDepoly yolov5 rv1126 demo 重现了这个现象,为啥测试不到视频教程中的150ms水平?不知道是否是精度不同rv1126 toolkit 跑yolov5s 960*960分辨率 时间只有100ms [I 9/24 13:55: 4. 37 ...re/optimizer/mir/generate_program_pass.h:41 GenProgram] insts.size: 1 [INFO] fastdeploy/runtime/runtime.cc(354)::CreateLiteBackend Runtime initialized with Backend::PDLITE in Device::TIMVX. [I 9/24 13:55: 4.270 ...r/src/driver/verisilicon_timvx/engine.cc:45 Context] properties: [I 9/24 13:55: 4.271 ...r/src/driver/verisilicon_timvx/engine.cc:57 Context] bn_fusion_max_allowed_quant_scale_deviation: -1 [W 9/24 13:55: 4.271 ...ter/nnadapter/src/runtime/compilation.cc:334 Finish] Warning: Failed to create a program, No model and cache is provided. [W 9/24 13:55: 4.271 ...le-Lite/lite/kernels/nnadapter/engine.cc:149 LoadFromCache] Warning: Build model failed(3) ! [W 9/24 13:55: 4.420 ...nnadapter/nnadapter/src/runtime/model.cc:86 GetSupportedOperations] Warning: Failed to get the supported operations for device 'verisilicon_timvx', because the HAL interface 'validate_program' is not implemented! [W 9/24 13:55: 4.420 ...kernels/nnadapter/converter/converter.cc:171 Apply] Warning: Failed to get the supported operations for the selected devices, one or more of the selected devices are not supported! [I 9/24 13:55: 4.420 ...r/src/driver/verisilicon_timvx/driver.cc:70 CreateProgram] Create program for verisilicon_timvx. Predict 0 run time is: 43.7804s Predict 1 run time is: 0.269886s Predict 2 run time is: 0.272425s Predict 3 run time is: 0.273191s Predict 4 run time is: 0.270363s