alibaba / MNN

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
http://www.mnn.zone/
8.37k stars 1.62k forks source link

多session时(多算法) CPU计算场景,内部线程池性能比openMP线程池差50% #2854

Open zhenjing opened 1 month ago

zhenjing commented 1 month ago

MNN内部线程池:MNN_THREAD_POOL_MAX_TASKS 2 限制最多2个算法使用线程池。 MNN原线程池的不足:1) 并发任务总是分配给低序号的线程,导致高序号的线程不处理计算;2)计算并发任务时,所有线程都被唤醒,线程使用自旋锁,导致多于并发数的线程处于空跑状态。

测试yolov8n.mnn模型,使用Session API方式,共享输入图片,对比测试内部线程池和openMP线程池。

测试结论: 1、openMP线程池性能最好,在6个算法句柄时,吞吐量90,平时耗时65ms;相比MNN内部线程池最大吞吐量51提升80%,同样6个句柄时,MNN内部线程池平均耗时176ms。 2、多个子线程池方案,在7个句柄时,吞吐量73,平均耗时95ms;相比MNN内部线程池最大吞吐量51提升40%,同样7个句柄时,MNN内部线程池平均耗时193ms。 3、yolov8模型并发任务计算时间和句柄数有关,在1个句柄时,并发任务的平均计算耗时0.1ms,在15个句柄时,并发任务的平均计算耗时0.6ms。

为啥将内部线程池作为默认线程池选项?

zhenjing commented 1 month ago

MNN编译选项:MNN_ARM82 测试yolov8n.mnn,使用Session API方式,共享输入图片。

鲲鹏920环境测试数据: 内部线程池: +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 1 | 302.21 | 299.49 | 307.17 | 101 | 100 | 119 | 99 | 3.31 | | 1 | 2 | 163.55 | 160.16 | 328.65 | 3080 | 2989 | 122 | 29 | 6.11 | | 1 | 3 | 133.54 | 131.09 | 153.16 | 3057 | 3026 | 119 | 28 | 7.49 | | 1 | 4 | 104.17 | 103.15 | 108.18 | 3011 | 2988 | 115 | 29 | 9.60 | | 1 | 5 | 89.63 | 88.48 | 100.63 | 2982 | 2958 | 122 | 29 | 11.15 | | 1 | 6 | 84.53 | 83.17 | 144.62 | 2927 | 2897 | 115 | 30 | 11.82 | | 1 | 7 | 73.92 | 72.51 | 121.87 | 2933 | 2901 | 116 | 31 | 13.52 | | 1 | 8 | 68.15 | 66.50 | 98.63 | 2910 | 2875 | 115 | 31 | 14.67 | | 1 | 9 | 69.13 | 67.43 | 78.88 | 2873 | 2841 | 120 | 32 | 14.45 | | 1 | 10 | 65.51 | 62.18 | 109.37 | 2861 | 2816 | 115 | 33 | 15.25 | | 1 | 11 | 66.33 | 64.78 | 104.34 | 2842 | 2810 | 114 | 33 | 15.06 | | 1 | 12 | 64.23 | 62.21 | 104.18 | 2829 | 2794 | 112 | 34 | 15.56 | | 1 | 13 | 60.36 | 57.25 | 91.09 | 2862 | 2807 | 125 | 34 | 16.55 | | 1 | 14 | 57.46 | 53.00 | 86.55 | 2825 | 2764 | 125 | 34 | 17.39 | | 1 | 15 | 57.06 | 54.58 | 79.47 | 2810 | 2763 | 119 | 35 | 17.51 | | 1 | 16 | 53.06 | 51.46 | 84.32 | 2791 | 2757 | 117 | 35 | 18.83 | | 1 | 17 | 56.50 | 53.74 | 75.90 | 2826 | 2773 | 124 | 36 | 17.69 | | 1 | 18 | 58.38 | 53.27 | 105.62 | 2826 | 2785 | 124 | 36 | 17.11 | | 1 | 19 | 58.60 | 56.58 | 78.08 | 2798 | 2767 | 117 | 36 | 17.05 | | 1 | 20 | 57.50 | 55.27 | 100.63 | 2794 | 2757 | 117 | 37 | 17.38 | | 1 | 21 | 55.87 | 54.27 | 66.56 | 2778 | 2744 | 115 | 37 | 17.89 | | 1 | 22 | 55.07 | 53.46 | 69.17 | 2773 | 2738 | 117 | 37 | 18.15 | | 1 | 23 | 51.62 | 49.48 | 70.91 | 2771 | 2732 | 116 | 37 | 19.36 | | 1 | 24 | 50.34 | 48.55 | 97.78 | 2762 | 2720 | 117 | 38 | 19.85 | | 1 | 25 | 49.48 | 47.93 | 81.96 | 2756 | 2717 | 117 | 38 | 20.20 | | 1 | 26 | 49.00 | 47.16 | 62.74 | 2748 | 2706 | 116 | 38 | 20.40 | | 1 | 27 | 45.74 | 43.68 | 83.62 | 2736 | 2701 | 116 | 39 | 21.85 | | 1 | 28 | 44.77 | 42.97 | 91.32 | 2726 | 2692 | 119 | 39 | 22.32 | | 1 | 29 | 43.99 | 42.50 | 83.11 | 2715 | 2681 | 119 | 39 | 22.71 | | 1 | 30 | 43.39 | 41.85 | 56.30 | 2703 | 2666 | 117 | 39 | 23.03 | | 1 | 31 | 42.99 | 41.19 | 60.30 | 2709 | 2670 | 117 | 40 | 23.24 | | 1 | 32 | 44.53 | 41.08 | 78.13 | 2691 | 2658 | 125 | 40 | 22.43 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 4 | 92.00 | 91.06 | 106.52 | 2993 | 2954 | 118 | 27 | 10.86 | | 2 | 4 | 309.84 | 91.53 | 550.67 | 1843 | 1102 | 203 | 29 | 6.43 | | 3 | 4 | 311.59 | 91.94 | 518.50 | 1389 | 1201 | 287 | 31 | 9.59 | | 4 | 4 | 316.22 | 93.55 | 520.95 | 1577 | 1186 | 371 | 33 | 12.58 | | 5 | 4 | 320.57 | 92.92 | 534.82 | 1677 | 1227 | 455 | 34 | 15.44 | | 6 | 4 | 322.07 | 95.07 | 482.06 | 1653 | 1377 | 539 | 36 | 18.54 | | 7 | 4 | 327.64 | 94.54 | 587.61 | 1420 | 1364 | 620 | 38 | 21.24 | | 8 | 4 | 338.09 | 136.99 | 542.82 | 2291 | 1820 | 707 | 39 | 23.29 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 2 | 161.70 | 159.80 | 180.62 | 3081 | 3031 | 668 | 33 | 6.18 | | 2 | 2 | 307.31 | 159.33 | 468.00 | 2619 | 1820 | 584 | 32 | 6.48 | | 3 | 2 | 311.39 | 160.73 | 467.74 | 2181 | 1872 | 584 | 32 | 9.58 | | 4 | 2 | 313.20 | 161.55 | 457.25 | 2222 | 1893 | 584 | 32 | 12.72 | | 5 | 2 | 322.94 | 160.54 | 471.00 | 2519 | 2034 | 586 | 32 | 15.35 | | 6 | 2 | 336.08 | 161.32 | 657.59 | 2477 | 2194 | 586 | 32 | 17.53 | | 7 | 2 | 341.25 | 163.78 | 696.82 | 2339 | 2000 | 636 | 33 | 20.35 | | 8 | 2 | 338.36 | 163.89 | 472.60 | 2075 | 1976 | 709 | 33 | 23.28 | | 9 | 2 | 349.00 | 174.64 | 659.26 | 2844 | 2261 | 793 | 34 | 25.46 | | 10 | 2 | 358.57 | 168.42 | 659.79 | 2969 | 2197 | 876 | 34 | 27.12 | | 11 | 2 | 368.14 | 167.50 | 749.81 | 2890 | 2392 | 958 | 35 | 29.22 | | 12 | 2 | 379.36 | 169.97 | 700.04 | 2888 | 2167 | 1040 | 35 | 30.91 | | 13 | 2 | 391.24 | 168.41 | 1212.57 | 2845 | 2205 | 1119 | 36 | 32.12 | | 14 | 2 | 431.56 | 165.49 | 1474.95 | 2701 | 2385 | 1208 | 36 | 30.71 | | 15 | 2 | 398.22 | 175.14 | 859.35 | 3111 | 2128 | 1295 | 37 | 36.19 | | 16 | 2 | 429.22 | 173.18 | 1837.27 | 3070 | 2747 | 1379 | 37 | 36.14 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

openMP线程池 +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 1 | 302.93 | 298.91 | 322.26 | 101 | 100 | 120 | 99 | 3.30 | | 1 | 2 | 167.06 | 165.37 | 170.31 | 189 | 185 | 121 | 99 | 5.98 | | 1 | 3 | 126.03 | 122.64 | 157.94 | 262 | 256 | 129 | 99 | 7.93 | | 1 | 4 | 100.89 | 96.24 | 127.97 | 335 | 324 | 143 | 98 | 9.91 | | 1 | 5 | 109.94 | 87.66 | 170.26 | 357 | 345 | 152 | 98 | 9.09 | | 1 | 6 | 98.22 | 78.51 | 112.05 | 403 | 385 | 149 | 97 | 10.18 | | 1 | 7 | 96.84 | 94.59 | 127.38 | 414 | 409 | 158 | 97 | 10.32 | | 1 | 8 | 90.21 | 70.48 | 109.52 | 451 | 446 | 152 | 97 | 11.08 | | 1 | 9 | 93.04 | 90.55 | 125.76 | 442 | 432 | 163 | 96 | 10.74 | | 1 | 10 | 86.95 | 85.53 | 98.38 | 470 | 463 | 161 | 96 | 11.50 | | 1 | 11 | 95.68 | 93.90 | 130.65 | 450 | 446 | 174 | 95 | 10.45 | | 1 | 12 | 91.33 | 84.73 | 115.54 | 485 | 466 | 182 | 95 | 10.94 | | 1 | 13 | 92.05 | 90.02 | 131.43 | 488 | 479 | 182 | 94 | 10.86 | | 1 | 14 | 90.80 | 88.67 | 131.91 | 492 | 484 | 186 | 94 | 11.01 | | 1 | 15 | 88.31 | 86.11 | 126.08 | 516 | 506 | 190 | 94 | 11.32 | | 1 | 16 | 86.64 | 83.26 | 126.51 | 537 | 527 | 184 | 93 | 11.54 | | 1 | 17 | 98.74 | 94.63 | 200.81 | 506 | 492 | 186 | 93 | 10.12 | | 1 | 18 | 98.36 | 94.58 | 131.67 | 506 | 496 | 185 | 93 | 10.16 | | 1 | 19 | 99.25 | 96.09 | 129.91 | 511 | 498 | 201 | 92 | 10.07 | | 1 | 20 | 102.10 | 97.94 | 127.53 | 505 | 497 | 196 | 92 | 9.79 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 4 | 98.97 | 98.10 | 105.72 | 365 | 336 | 195 | 92 | 10.10 | | 2 | 4 | 111.05 | 99.38 | 125.07 | 638 | 614 | 282 | 93 | 17.95 | | 3 | 4 | 125.96 | 102.84 | 146.91 | 931 | 924 | 350 | 93 | 23.78 | | 4 | 4 | 129.78 | 106.55 | 184.36 | 1230 | 1136 | 438 | 93 | 30.68 | | 5 | 4 | 121.97 | 103.38 | 160.35 | 1603 | 1451 | 456 | 97 | 40.64 | | 6 | 4 | 129.07 | 105.19 | 161.60 | 1875 | 1592 | 586 | 97 | 46.11 | | 7 | 4 | 132.34 | 108.91 | 180.89 | 2182 | 2021 | 695 | 97 | 52.15 | | 8 | 4 | 134.77 | 109.17 | 212.90 | 2461 | 1950 | 815 | 97 | 58.69 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 2 | 167.21 | 165.35 | 171.93 | 991 | 220 | 498 | 93 | 5.98 | | 2 | 2 | 189.64 | 169.03 | 211.02 | 376 | 355 | 504 | 93 | 10.53 | | 3 | 2 | 184.25 | 166.91 | 214.19 | 554 | 535 | 507 | 93 | 16.24 | | 4 | 2 | 194.37 | 170.20 | 223.24 | 743 | 692 | 507 | 94 | 20.39 | | 5 | 2 | 191.27 | 172.01 | 225.70 | 924 | 850 | 563 | 94 | 26.02 | | 6 | 2 | 202.51 | 169.93 | 225.91 | 1096 | 990 | 653 | 94 | 29.38 | | 7 | 2 | 208.42 | 173.66 | 239.81 | 1268 | 1144 | 723 | 94 | 33.17 | | 8 | 2 | 198.68 | 172.22 | 239.96 | 1469 | 1294 | 806 | 94 | 39.72 | | 9 | 2 | 202.30 | 174.92 | 235.47 | 1643 | 1366 | 896 | 94 | 43.80 | | 10 | 2 | 207.00 | 180.15 | 252.07 | 1839 | 1526 | 1010 | 97 | 47.65 | | 11 | 2 | 206.64 | 178.10 | 267.44 | 2026 | 1519 | 1137 | 97 | 52.55 | | 12 | 2 | 210.54 | 180.79 | 266.52 | 2204 | 1881 | 1224 | 97 | 56.25 | | 13 | 2 | 218.15 | 186.11 | 261.90 | 2349 | 1654 | 1306 | 97 | 58.45 | | 14 | 2 | 220.81 | 183.64 | 277.31 | 2555 | 1935 | 1380 | 97 | 62.21 | | 15 | 2 | 234.77 | 194.08 | 282.70 | 2697 | 1904 | 1456 | 97 | 62.86 | | 16 | 2 | 231.98 | 184.98 | 328.54 | 2828 | 2338 | 1530 | 97 | 67.51 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

zhenjing commented 1 month ago

内部线程池性能优化到比openMP线程池一样或更好吗?

zhenjing commented 1 month ago

将队列换成无锁队列 https://github.com/cameron314/concurrentqueue 做过测试。数据如下: 线程池: 1、采用多个子线程池,每个线程池4个并发线程,任务队列采用无锁阻塞队列 2、每个算法句柄绑定特定线程池

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 4 | 52.11 | 43.25 | 91.51 | 310 | 299 | 127 | 95 | 19.18 | | 2 | 4 | 73.77 | 45.71 | 145.01 | 587 | 565 | 220 | 95 | 27.06 | | 3 | 4 | 76.27 | 48.91 | 132.90 | 859 | 811 | 313 | 95 | 39.23 | | 4 | 4 | 81.51 | 48.48 | 139.89 | 1136 | 1070 | 406 | 95 | 48.85 | | 5 | 4 | 87.28 | 53.61 | 136.57 | 1440 | 1295 | 499 | 95 | 57.06 | | 6 | 4 | 90.79 | 54.60 | 150.20 | 1694 | 1509 | 592 | 95 | 65.70 | | 7 | 4 | 94.99 | 56.78 | 154.01 | 1901 | 1691 | 685 | 95 | 73.11 | | 8 | 4 | 119.03 | 59.02 | 213.66 | 2135 | 1873 | 778 | 95 | 66.81 | | 9 | 4 | 145.49 | 66.99 | 245.22 | 2397 | 2065 | 871 | 95 | 61.40 | | 10 | 4 | 166.84 | 70.86 | 305.54 | 2639 | 2322 | 965 | 95 | 59.49 | | 11 | 4 | 184.74 | 80.89 | 350.71 | 2878 | 2659 | 1057 | 95 | 59.23 | | 12 | 4 | 229.78 | 81.02 | 329.65 | 3128 | 2964 | 1150 | 95 | 52.00 | | 13 | 4 | 230.75 | 103.11 | 336.60 | 3344 | 2895 | 1243 | 95 | 56.08 | | 14 | 4 | 268.87 | 93.64 | 431.35 | 3531 | 2992 | 1335 | 95 | 51.82 | | 15 | 4 | 262.94 | 72.90 | 2979.28 | 3176 | 2362 | 1429 | 95 | 53.75 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

线程池: 1、采用单个线程池,任务队列采用无锁阻塞队列

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+ | 1 | 4 | 72.99 | 55.21 | 112.32 | 304 | 294 | 127 | 94 | 13.70 | | 2 | 4 | 84.10 | 77.08 | 123.71 | 591 | 574 | 221 | 94 | 23.73 | | 3 | 4 | 93.11 | 84.60 | 152.11 | 887 | 858 | 313 | 94 | 32.19 | | 4 | 4 | 97.14 | 86.18 | 135.40 | 1173 | 1123 | 407 | 94 | 41.06 | | 5 | 4 | 105.34 | 74.98 | 143.07 | 1458 | 1354 | 500 | 95 | 47.25 | | 6 | 4 | 113.20 | 98.05 | 143.42 | 1745 | 1631 | 593 | 95 | 52.93 | | 7 | 4 | 145.99 | 112.52 | 189.18 | 2033 | 1854 | 685 | 95 | 47.81 | | 8 | 4 | 164.72 | 134.42 | 201.22 | 2288 | 2230 | 778 | 95 | 48.45 | | 9 | 4 | 167.02 | 118.06 | 213.49 | 2529 | 2265 | 863 | 95 | 53.73 | | 10 | 4 | 196.22 | 101.76 | 256.27 | 2821 | 2496 | 963 | 95 | 50.69 | | 11 | 4 | 241.54 | 143.93 | 293.38 | 3027 | 2773 | 1057 | 95 | 45.31 | | 12 | 4 | 251.07 | 190.37 | 297.89 | 3262 | 2887 | 1150 | 95 | 47.72 | | 13 | 4 | 283.10 | 168.88 | 371.04 | 3471 | 3106 | 1242 | 95 | 45.72 | | 14 | 4 | 310.47 | 214.52 | 369.59 | 3665 | 3216 | 1335 | 95 | 44.92 | | 15 | 4 | 359.19 | 172.35 | 458.20 | 3826 | 3775 | 1429 | 95 | 41.52 | +-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

jxt1234 commented 3 weeks ago

内部线程池主要考虑少量实例(小于2)的加速。在多实例情况下一般建议全部用单线程,外部用线程池,也可自行换成 openmp .