Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
Other
20.22k stars 4.15k forks source link

benchncnn gpu速度慢于cpu #2070

Open qinb opened 4 years ago

qinb commented 4 years ago

您好,我在利用benchncnn测试模型遇到一些速度问题

测试平台: linux服务器、骁龙845芯片 ncn版本: 20200727 编译链接:https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-linux-x86 https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-android 编译了vulkan和ncnn

linux服务器 cpu利用benchncnn测试模型速度如下: ./benchncnn 4 8 loop_count = 4 num_threads = 8 powersave = 0 gpu_device = -1 cooling_down = 1 squeezenet min = 8.18 max = 8.46 avg = 8.34 squeezenet_int8 min = 44.71 max = 44.98 avg = 44.83 mobilenet min = 8.87 max = 9.01 avg = 8.93 mobilenet_int8 min = 74.36 max = 74.71 avg = 74.55 mobilenet_v2 min = 8.80 max = 8.92 avg = 8.88 mobilenet_v3 min = 8.04 max = 8.21 avg = 8.13 shufflenet min = 13.26 max = 13.48 avg = 13.34 shufflenet_v2 min = 8.65 max = 8.97 avg = 8.81 mnasnet min = 8.80 max = 8.97 avg = 8.90 proxylessnasnet min = 9.63 max = 9.68 avg = 9.65 efficientnet_b0 min = 12.54 max = 18.29 avg = 14.16 regnety_400m min = 31.89 max = 32.09 avg = 31.99 blazeface min = 4.31 max = 4.49 avg = 4.40 googlenet min = 27.15 max = 27.25 avg = 27.19 googlenet_int8 min = 124.22 max = 124.68 avg = 124.43 resnet18 min = 23.90 max = 41.61 avg = 28.61 resnet18_int8 min = 56.79 max = 57.23 avg = 57.01 alexnet min = 20.25 max = 20.30 avg = 20.27 vgg16 min = 72.27 max = 73.28 avg = 72.84 vgg16_int8 min = 245.54 max = 284.55 avg = 256.44 resnet50 min = 46.32 max = 49.89 avg = 47.43 resnet50_int8 min = 258.61 max = 302.29 avg = 270.11

linux服务器 gpu利用benchncnn测试模型速度如下: ./benchncnn 4 8 0 0 user_vulkan_compute [0 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [0 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [0 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [1 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [1 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [1 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [2 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [2 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [2 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [3 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [3 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [3 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [4 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [4 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [4 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [5 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [5 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [5 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [6 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [6 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [6 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [7 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [7 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [7 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 loop_count = 4 num_threads = 8 powersave = 0 gpu_device = 0 cooling_down = 1 mnasnet_075_1 min = 0.01 max = 0.03 avg = 0.02 squeezenet min = 5.05 max = 9.40 avg = 6.67 squeezenet_int8 min = 45.19 max = 45.58 avg = 45.40 mobilenet min = 5.83 max = 8.32 avg = 6.69 mobilenet_int8 min = 74.67 max = 75.06 avg = 74.83 mobilenet_v2 min = 8.76 max = 12.21 avg = 10.38 mobilenet_v3 min = 12.07 max = 13.68 avg = 13.09 shufflenet min = 6.23 max = 6.41 avg = 6.31 shufflenet_v2 min = 7.49 max = 26.27 avg = 16.05 mnasnet min = 9.24 max = 12.31 avg = 10.17 proxylessnasnet min = 8.65 max = 12.14 avg = 10.46 efficientnet_b0 min = 13.99 max = 15.68 avg = 14.98 regnety_400m min = 9.20 max = 11.58 avg = 10.65 blazeface min = 6.01 max = 6.12 avg = 6.06 googlenet min = 13.36 max = 16.57 avg = 15.34 googlenet_int8 min = 124.65 max = 126.07 avg = 125.29 resnet18 min = 8.36 max = 10.43 avg = 9.70 resnet18_int8 min = 57.23 max = 57.77 avg = 57.45 alexnet min = 9.91 max = 11.22 avg = 10.25 vgg16 min = 15.13 max = 17.91 avg = 16.38 vgg16_int8 min = 251.91 max = 281.56 avg = 266.15 resnet50 min = 13.68 max = 16.11 avg = 14.72 resnet50_int8 min = 260.55 max = 385.56 avg = 319.25

总体上看,gpu上的速度慢于或持平cpu,甚至慢一倍,比如shufflenet_v2。为什么和 ncnn给出的 gpu/cpu速度对比 差别这么大。 烦请 @nihui @cook 解答一下,感谢!

nihui commented 4 years ago

测试gpu的循环次数太少,可以设100,并建议pc/服务器上关掉cooldown测试

qinb commented 4 years ago

测试gpu的循环次数太少,可以设100,并建议pc/服务器上关掉cooldown测试

感谢回复,按照你的提示跑了一下, linux的结果: ./benchncnn 1000 8 0 -1 0 loop_count = 1000 num_threads = 8 powersave = 0 gpu_device = -1 cooling_down = 0 mnasnet_075_1 min = 1.79 max = 64.12 avg = 3.21 squeezenet min = 7.92 max = 198.55 avg = 11.83 squeezenet_int8 min = 44.55 max = 263.16 avg = 71.62 mobilenet min = 8.82 max = 204.38 avg = 14.27 mobilenet_int8 min = 74.37 max = 387.81 avg = 140.53 mobilenet_v2 min = 8.00 max = 135.77 avg = 12.72 mobilenet_v3 min = 7.19 max = 91.01 avg = 11.23 shufflenet min = 10.35 max = 181.38 avg = 16.81 shufflenet_v2 min = 6.49 max = 96.28 avg = 9.47 mnasnet min = 7.73 max = 166.82 avg = 12.00 proxylessnasnet min = 8.42 max = 168.83 avg = 13.90 efficientnet_b0 min = 11.22 max = 117.04 avg = 18.09 regnety_400m min = 25.36 max = 199.28 avg = 36.95 blazeface min = 2.75 max = 206.53 avg = 4.30 googlenet min = 23.78 max = 245.20 avg = 34.18 googlenet_int8 min = 119.44 max = 436.06 avg = 209.70 resnet18 min = 20.97 max = 229.74 avg = 30.29 resnet18_int8 min = 52.97 max = 352.52 avg = 97.98 alexnet min = 20.51 max = 223.52 avg = 28.42 vgg16 min = 63.91 max = 315.63 avg = 102.57 vgg16_int8 min = 236.62 max = 597.59 avg = 395.14 resnet50 min = 40.71 max = 268.15 avg = 68.79 resnet50_int8 min = 268.45 max = 650.54 avg = 477.58

gpu: ./benchncnn 1000 8 0 0 0 [0 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [0 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [0 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [1 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [1 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [1 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [2 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [2 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [2 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [3 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [3 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [3 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [4 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [4 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [4 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [5 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [5 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [5 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [6 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [6 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [6 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [7 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [7 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [7 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 loop_count = 1000 num_threads = 8 powersave = 0 gpu_device = 0 cooling_down = 0 mnasnet_075_1 min = 8.84 max = 194.57 avg = 39.61 squeezenet min = 7.13 max = 195.07 avg = 32.47 squeezenet_int8 min = 42.70 max = 294.39 avg = 72.15 mobilenet min = 7.33 max = 136.11 avg = 32.82 mobilenet_int8 min = 73.48 max = 327.52 avg = 124.04 mobilenet_v2 min = 8.41 max = 217.73 avg = 39.64 mobilenet_v3 min = 8.99 max = 347.23 avg = 46.29 shufflenet min = 7.80 max = 309.16 avg = 40.67 shufflenet_v2 min = 6.73 max = 387.88 avg = 45.32 mnasnet min = 8.42 max = 218.01 avg = 36.34 proxylessnasnet min = 8.40 max = 245.90 avg = 39.90 efficientnet_b0 min = 10.91 max = 457.91 avg = 55.53 regnety_400m min = 9.49 max = 409.92 avg = 45.78 blazeface min = 8.55 max = 247.95 avg = 41.98 googlenet min = 11.69 max = 409.37 avg = 53.85 googlenet_int8 min = 119.76 max = 470.34 avg = 216.06 resnet18 min = 8.37 max = 235.66 avg = 39.50 resnet18_int8 min = 52.83 max = 300.64 avg = 98.56 alexnet min = 9.03 max = 102.21 avg = 33.46 vgg16 min = 16.08 max = 183.23 avg = 59.31 vgg16_int8 min = 255.26 max = 652.32 avg = 404.95 resnet50 min = 11.73 max = 353.59 avg = 51.70 resnet50_int8 min = 271.83 max = 739.32 avg = 504.47

看上去,结论还是一样,gpu比cpu要慢。期待你的回答!

nihui commented 4 years ago

这个耗时很不稳定,跑的时候,有其他程序在抢gpu资源?

qinb commented 4 years ago

不好意思,我忘记后台在用gpu,现在跑完最新测了一个时间,麻烦看一下,多谢

./benchncnn 1000 8 0 -1 0 loop_count = 1000 num_threads = 8 powersave = 0 gpu_device = -1 cooling_down = 0 mnasnet_075_1 min = 2.39 max = 7.43 avg = 2.51 squeezenet min = 7.74 max = 24.00 avg = 8.01 squeezenet_int8 min = 43.79 max = 67.56 avg = 45.03 mobilenet min = 8.86 max = 11.07 avg = 8.99 mobilenet_int8 min = 73.61 max = 105.05 avg = 74.65 mobilenet_v2 min = 8.74 max = 16.97 avg = 9.08 mobilenet_v3 min = 7.65 max = 20.16 avg = 7.90 shufflenet min = 12.90 max = 14.62 avg = 13.13 shufflenet_v2 min = 8.17 max = 10.44 avg = 8.38 mnasnet min = 8.60 max = 9.95 avg = 8.72 proxylessnasnet min = 9.40 max = 11.96 avg = 9.74 efficientnet_b0 min = 12.31 max = 14.00 avg = 12.43 regnety_400m min = 31.64 max = 49.67 avg = 31.94 blazeface min = 3.29 max = 4.37 avg = 3.39 googlenet min = 26.73 max = 64.05 avg = 27.29 googlenet_int8 min = 122.73 max = 200.78 avg = 125.63 resnet18 min = 22.69 max = 24.24 avg = 22.89 resnet18_int8 min = 56.64 max = 81.41 avg = 57.45 alexnet min = 21.55 max = 22.79 avg = 21.63 vgg16 min = 63.53 max = 208.52 avg = 69.16 vgg16_int8 min = 240.76 max = 380.84 avg = 247.08 resnet50 min = 45.06 max = 63.16 avg = 45.50 resnet50_int8 min = 257.54 max = 377.70 avg = 260.97

./benchncnn 1000 8 0 0 0 [0 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [0 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [0 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [1 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [1 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [1 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [2 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [2 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [2 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [3 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [3 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [3 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [4 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [4 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [4 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [5 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [5 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [5 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [6 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [6 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [6 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 [7 TITAN Xp] queueC=2[8] queueG=0[16] queueT=1[1] [7 TITAN Xp] bugsbn1=0 buglbia=0 bugcopc=0 bugihfa=0 [7 TITAN Xp] fp16p=1 fp16s=1 fp16a=0 int8s=1 int8a=1 loop_count = 1000 num_threads = 8 powersave = 0 gpu_device = 0 cooling_down = 0 mnasnet_075_1 min = 5.86 max = 18.05 avg = 8.64 squeezenet min = 3.85 max = 14.30 avg = 6.18 squeezenet_int8 min = 44.15 max = 61.06 avg = 45.09 mobilenet min = 4.61 max = 17.02 avg = 6.99 mobilenet_int8 min = 73.88 max = 111.23 avg = 75.46 mobilenet_v2 min = 6.00 max = 14.70 avg = 8.95 mobilenet_v3 min = 7.21 max = 17.62 avg = 10.54 shufflenet min = 4.97 max = 18.81 avg = 7.75 shufflenet_v2 min = 6.90 max = 15.06 avg = 10.18 mnasnet min = 6.68 max = 16.01 avg = 10.22 proxylessnasnet min = 6.58 max = 15.79 avg = 9.66 efficientnet_b0 min = 7.48 max = 17.33 avg = 11.14 regnety_400m min = 6.11 max = 18.44 avg = 9.80 blazeface min = 3.56 max = 14.03 avg = 7.15 googlenet min = 7.65 max = 16.96 avg = 11.25 googlenet_int8 min = 122.71 max = 163.65 avg = 126.10 resnet18 min = 5.07 max = 21.60 avg = 8.22 resnet18_int8 min = 56.06 max = 100.09 avg = 60.23 alexnet min = 5.15 max = 15.68 avg = 7.55 vgg16 min = 11.84 max = 23.80 avg = 15.32 vgg16_int8 min = 242.02 max = 290.89 avg = 248.77 resnet50 min = 7.72 max = 17.52 avg = 11.21 resnet50_int8 min = 257.47 max = 379.46 avg = 263.37

整体上看,小模型上,gpu速度并不比cpu速度快。

CsVeryLoveXieWenLi commented 4 months ago

搜索意外看到这个issues,终结了吗?如果可以的话再用新版本测试一遍吧,如果没问题就close。