PaddlePaddle / benchmark

77 stars 157 forks source link

Optimize inference performance of ERNIE on P40 GPU #165

Open Xreki opened 5 years ago

Xreki commented 5 years ago

负责人

@Xreki @zhaoyuchen2018

初始性能

NVIDIA BERT推理解决方案Faster Transformer开源了

Xreki commented 5 years ago

Profile结果

------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::mul                        370666      32332.5     12646.616698 (0.391142) 19685.930916 (0.608858) 0.026226    95.3523     0.0872283   0.341415
thread0::elementwise_add            561008      16561.7     14064.876774 (0.849241) 2496.820884 (0.150759)  0.017456    9.10075     0.0295213   0.174883
thread0::softmax                    60108       12311.9     4285.399399 (0.348069)  8026.520592 (0.651931)  0.18358     6.97239     0.20483     0.130007
thread0::matmul                     125225      7621.9      4999.121478 (0.655889)  2622.776986 (0.344111)  0.026823    6.93527     0.0608656   0.0804832
thread0::layer_norm                 125225      6890.62     3494.377954 (0.507121)  3396.237224 (0.492879)  0.046834    6.54613     0.0550259   0.0727612
thread0::transpose2                 240432      6706.17     4940.736832 (0.736745)  1765.436248 (0.263255)  0.022342    6.87583     0.0278922   0.0708136
thread0::dropout                    190342      4844.09     4380.131806 (0.904222)  463.956297 (0.095778)   0.019757    1.91613     0.0254494   0.051151
thread0::reshape2                   240432      1970.44     1970.427168 (0.999993)  0.013691 (0.000007)     0.005761    0.758181    0.00819542  0.0208068
thread0::scale                      70126       1559.56     1413.281218 (0.906202)  146.283474 (0.093798)   0.017739    6.40694     0.0222395   0.0164682
thread0::relu                       60108       1533.83     1155.052049 (0.753053)  378.773663 (0.246947)   0.020695    1.91584     0.0255178   0.0161964
thread0::stack                      5009        670.06      620.412429 (0.925906)   49.647450 (0.074094)    0.104312    3.43668     0.133771    0.00707548
thread0::lookup_table               15027       499.303     409.325315 (0.819793)   89.978163 (0.180207)    0.025125    0.087498    0.0332271   0.00527238
thread0::fetch                      5009        260.127     260.068816 (0.999777)   0.057970 (0.000223)     0.044476    1.01272     0.0519319   0.0027468
thread0::TensorCopy:GPU->CPU        5009        210.123     210.076980 (0.999782)   0.045869 (0.000218)     0.036519    0.996315    0.0419491   0.00221879
thread0::GpuMemcpyAsync:CPU->GPU    20036       204.476     173.512282 (0.848571)   30.963505 (0.151429)    0.007       1.77937     0.0102054   0.00215916
thread0::GpuMemcpySync:GPU->CPU     5009        203.349     194.875273 (0.958327)   8.474119 (0.041673)     0.035868    0.330009    0.0405968   0.00214726
thread0::slice                      5009        153.525     136.187490 (0.887070)   17.337539 (0.112930)    0.025069    1.98001     0.0306498   0.00162114
thread0::tanh                       5009        110.729     100.944808 (0.911638)   9.784189 (0.088362)     0.017612    1.92101     0.022106    0.00116924
thread0::feed                       20036       57.2692     57.263221 (0.999896)    0.005970 (0.000104)     0.00156     0.032812    0.00285831  0.000604732
Xreki commented 5 years ago

TImeline分析

Xreki commented 5 years ago

优化方案

image

image

image

Xreki commented 5 years ago

Intel相关的一些工作

Xreki commented 5 years ago

确定模型中的reshape操作是否必须的,是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki

image

I0814 10:13:53.083331 11085 operator.cc:170] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 768]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 768]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
I0814 10:13:53.083364 11085 operator.cc:993] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0814 10:13:53.083400 11085 tensor_util.cu:28] TensorCopy 1, 128, 12, 64 from CPUPlace to CPUPlace
I0814 10:13:53.083407 11085 tensor_util.cu:43] Skip copy the same data async from CPUPlace to CPUPlace
I0814 10:13:53.083420 11085 operator.cc:190] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 12, 64]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 12, 64]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.

结论:不能直接移除

Xreki commented 5 years ago

优化效果汇总

版本 P40时间 (Total) 时间 (去掉第一次run) 测试时间 PR 版本描述 加速效果
0 8.36 2019-08-14 - baseline -
1 (merged) 8.22 8.14 2019-08-22 https://github.com/PaddlePaddle/Paddle/pull/19297 移除dropout_op 1.7%
2 (merged) 7.37 6.96 2019-08-22 https://github.com/PaddlePaddle/Paddle/pull/19346 softmax使用cudnn 10%
3 (merged) 7.20 6.82 2019-09-10 https://github.com/PaddlePaddle/Paddle/pull/19687
https://github.com/PaddlePaddle/Paddle/pull/19733
融合FC计算,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-529325915 2.1%
4 (merged) 6.93 6.55 2019-09-11 https://github.com/PaddlePaddle/Paddle/pull/19776 融合FC+elementwise_add+layer_norm,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-530245969 4%
5 (merged) 6.13 5.73 2019-09-19 https://github.com/PaddlePaddle/Paddle/pull/20167 https://github.com/PaddlePaddle/Paddle/pull/20151 融合multi-head attention,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-555797150 14%
3 7.26 6.88 2019-09-03 https://github.com/PaddlePaddle/Paddle/pull/19610 优化stack_op,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-527316205 1.2%
Xreki commented 5 years ago

在版本1的基础上,在预测了使用了fuse_elewise_add_act_pass,可以融合elementwise_add+relu的计算,并成功匹配到14个子图,但总体性能下降。

+ /paddle/build_paddle/build_docker_manylinux_cuda90/paddle/fluid/inference/tests/api/samples/ernie_tester --logtostderr --model_dir=/data/ernie/model --data=/data/ernie/seq128_data/test_ds --repeat=1 --warmup_steps=1 --use_gpu=true --use_analysis=true --print_outputs=false --profile=false
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ernie
[ RUN      ] ernie.profile
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [fuse_elewise_add_act_pass]
---  detected 14 subgraphs
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0828 08:49:37.436837  7631 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0828 08:49:37.726634  7631 analysis_predictor.cc:475] == optimize end ==
I0828 08:49:38.710613  7631 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W0828 08:49:39.109542  7631 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
W0828 08:49:39.115780  7631 device_context.cc:269] device: 0, cuDNN Version: 7.3.
W0828 08:49:39.116070  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.116091  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.116104  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.127769  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.127822  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.127832  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.137923  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.137972  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.137981  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.147920  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.147967  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.147977  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.157951  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.157996  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.158005  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
I0828 08:50:21.810214  7631 ernie_tester.cc:345] Run 5010 samples, average latency: 8.60096 ms per sample.
I0828 08:50:21.810283  7631 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 8.51942 ms per sample.
[       OK ] ernie.profile (46355 ms)
[----------] 1 test from ernie (46355 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (46356 ms total)
[  PASSED  ] 1 test.
zhaoyuchen2018 commented 5 years ago

融合3个path的fc,reshape,transpose,scale到一个path. P40上测试结果如下: image

融合后的图如下所示: image 融合前: image

profile结果

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::mul                        250450      30086.2     10700.311709 (0.355655) 19385.885160 (0.644345) 0.027581    88.3573     0.120129    0.39598
thread0::elementwise_add            440792      13893.1     11593.975031 (0.834514) 2299.111468 (0.165486)  0.019193    12.8221     0.0315185   0.182854
thread0::matmul                     125225      7716.83     5090.761228 (0.659696)  2626.072493 (0.340304)  0.02993     7.32385     0.0616237   0.101565
thread0::layer_norm                 125225      7091.46     3754.750949 (0.529475)  3336.704465 (0.470525)  0.048728    7.12371     0.0566297   0.0933344
thread0::softmax                    60108       4259.71     1991.152624 (0.467438)  2268.559610 (0.532562)  0.063841    9.24836     0.0708676   0.0560643
thread0::transpose2                 120216      4126.7      2756.571446 (0.667985)  1370.123630 (0.332015)  0.025956    9.50509     0.0343273   0.0543136
thread0::split                      60108       2998.5      1830.882725 (0.610599)  1167.620505 (0.389401)  0.043009    19.4484     0.0498853   0.0394649
thread0::relu                       60108       1620.75     1233.311364 (0.760953)  387.433865 (0.239047)   0.021955    9.30877     0.0269639   0.0213315
thread0::reshape2                   120216      1279.38     1279.361341 (0.999987)  0.016020 (0.000013)     0.008744    1.83394     0.0106423   0.0168386
thread0::stack                      5009        673.79      610.591735 (0.906205)   63.198156 (0.093795)    0.110177    11.1131     0.134516    0.0088681
thread0::lookup_table               15027       517.377     424.928699 (0.821314)   92.448017 (0.178686)    0.025679    6.76539     0.0344298   0.00680947
thread0::fetch                      5009        342.348     342.270593 (0.999773)   0.077819 (0.000227)     0.057001    1.4229      0.0683467   0.00450583
thread0::TensorCopy:GPU->CPU        5009        261.036     260.979443 (0.999784)   0.056311 (0.000216)     0.04429     0.284401    0.0521133   0.00343563
thread0::scale                      10018       253.319     237.712034 (0.938391)   15.606845 (0.061609)    0.020368    0.161032    0.0252864   0.00333406
thread0::GpuMemcpyAsync:CPU->GPU    20036       248.089     208.310791 (0.839661)   39.778466 (0.160339)    0.008811    1.84374     0.0123822   0.00326523
thread0::GpuMemcpySync:GPU->CPU     5009        241.886     232.381808 (0.960708)   9.504255 (0.039292)     0.040716    0.390521    0.0482903   0.00318359
thread0::slice                      5009        160.703     143.249188 (0.891391)   17.453703 (0.108609)    0.02768     0.07693     0.0320828   0.0021151
thread0::tanh                       5009        117.589     107.550767 (0.914632)   10.038396 (0.085368)    0.01952     0.884425    0.0234756   0.00154765
thread0::feed                       20036       90.2748     90.263520 (0.999875)    0.011311 (0.000125)     0.002949    0.964233    0.00450563  0.00118815
Xreki commented 5 years ago

stackconcat的功能类似。尝试:

方法一:用concat代替stack,运行出现错误,发现stackconcat输出的维度不一样,不可行

方法二:参考concat,重新实现stack

I0903 09:57:36.458448 29283 ernie_tester.cc:345] Run 5010 samples, average latency: 7.26373 ms per sample.
I0903 09:57:36.458526 29283 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.87974 ms per sample.

以上测试结果基于版本2。

I0920 09:15:41.505820 25415 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92085 ms per sample.
I0920 09:15:41.505903 25415 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.51521 ms per sample.

以上测试结果基于版本4。

Xreki commented 5 years ago

实现fc的GPU kernelhttps://github.com/PaddlePaddle/Paddle/pull/19687 FC的融合https://github.com/PaddlePaddle/Paddle/pull/19733

Place: All Time unit: ms Sorted by total time in descending order in the same thread

Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. thread0::fc 370666 37927.8 16796.211849 (0.442847) 21131.552513 (0.557153) 0.038417 81.8839 0.102323 0.511083 thread0::matmul 125225 7542.57 4853.428136 (0.643472) 2689.138129 (0.356528) 0.028322 6.91776 0.0602321 0.101637 thread0::layer_norm 125225 6785.64 3458.361240 (0.509659) 3327.277901 (0.490341) 0.046506 6.78461 0.0541876 0.0914376 thread0::transpose2 240432 6648.82 4993.952105 (0.751103) 1654.869939 (0.248897) 0.022683 6.97236 0.0276536 0.089594 thread0::elementwise_add 190342 4846.03 3903.958820 (0.805599) 942.072546 (0.194401) 0.018709 7.06821 0.0254596 0.0653011 thread0::softmax 60108 4155.86 1873.948821 (0.450917) 2281.914897 (0.549083) 0.062806 7.22235 0.0691399 0.056001 thread0::reshape2 240432 2117.24 2117.212819 (0.999986) 0.030675 (0.000014) 0.006925 2.54413 0.008806 0.0285302 thread0::scale 70126 1639.13 1402.350356 (0.855547) 236.777127 (0.144453) 0.017914 3.11279 0.023374 0.0220875 thread0::stack 5009 651.86 601.077905 (0.922097) 50.782113 (0.077903) 0.106259 4.12809 0.130138 0.00878393 thread0::lookup_table 15027 479.71 388.084297 (0.808999) 91.625230 (0.191001) 0.024586 0.085041 0.0319232 0.00646417 thread0::fetch 5009 333.739 333.664085 (0.999776) 0.074696 (0.000224) 0.054763 2.98519 0.0666278 0.00449719 thread0::TensorCopy:GPU->CPU 5009 264.276 264.216676 (0.999777) 0.059003 (0.000223) 0.043246 2.97079 0.0527602 0.00356116 thread0::GpuMemcpySync:GPU->CPU 5009 244.643 235.804134 (0.963870) 8.839005 (0.036130) 0.04009 1.58306 0.0488407 0.00329661 thread0::GpuMemcpyAsync:CPU->GPU 20036 226.793 195.272127 (0.861014) 31.521131 (0.138986) 0.007882 1.17277 0.0113193 0.00305608 thread0::slice 5009 147.886 130.195979 (0.880383) 17.689569 (0.119617) 0.025414 0.103978 0.029524 0.00199278 thread0::tanh 5009 122.009 112.002519 (0.917989) 10.006004 (0.082011) 0.019717 0.094528 0.0243579 0.00164409 thread0::feed 20036 76.5896 76.581383 (0.999892) 0.008248 (0.000108) 0.002877 0.308487 0.0038226 0.00103206

Xreki commented 5 years ago

fc+elementwise_add+layer_norm融合

--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
---  detected 24 subgraphs
.
.
.
I0911 06:41:55.494792  7238 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92849 ms per sample.
I0911 06:41:55.494873  7238 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.54723 ms per sample.
Xreki commented 5 years ago

multi-head attention融合

--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
I1120 01:40:10.299883  6510 graph_pattern_detector.cc:97] ---  detected 12 subgraphs
--- Running IR pass [fc_fuse_pass]
I1120 01:40:10.344569  6510 graph_pattern_detector.cc:97] ---  detected 12 subgraphs
I1120 01:40:10.357105  6510 graph_pattern_detector.cc:97] ---  detected 26 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1120 01:40:10.397455  6510 graph_pattern_detector.cc:97] ---  detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1120 01:40:10.509421  6510 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I1120 01:40:10.758424  6510 analysis_predictor.cc:474] ======= optimize end =======
I1120 01:40:11.764863  6510 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W1120 01:40:12.444176  6510 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
.
.
.
I1120 01:40:43.862216  6510 ernie_tester.cc:345] Run 5010 samples, average latency: 6.40495 ms per sample.
I1120 01:40:43.862303  6510 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 5.74544 ms per sample.
------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fused_fc_elementwise_layernorm    115207      17021.4     5933.953144 (0.348617)  11087.450962 (0.651383) 0.09927     8.71612     0.147746    0.259676
thread0::multihead_matmul                  60108       11252.9     7224.504494 (0.642012)  4028.405127 (0.357988)  0.169878    6.57812     0.187212    0.171673
thread0::fc                                75135       10707.1     4357.612778 (0.406985)  6349.445604 (0.593015)  0.038466    85.6766     0.142504    0.163345
thread0::mul                               180324      9774.02     4994.708598 (0.511019)  4779.307233 (0.488981)  0.047804    9.68787     0.0542025   0.149111
thread0::fetch                             5009        4630.95     4629.392276 (0.999664)  1.554398 (0.000336)     0.055675    1.92049     0.924525    0.0706489
thread0::TensorCopy:GPU->CPU               5009        4573.44     4571.905707 (0.999664)  1.538373 (0.000336)     0.04468     1.90973     0.913045    0.0697717
thread0::GpuMemcpySync:GPU->CPU            5009        4553.78     4543.954556 (0.997843)  9.821141 (0.002157)     0.040962    1.9062      0.909119    0.0694716
thread0::stack                             5009        684.909     634.119137 (0.925844)   50.789981 (0.074156)    0.106366    2.25957     0.136736    0.0104489
thread0::layer_norm                        10018       549.522     290.654864 (0.528923)   258.867095 (0.471077)   0.048854    0.142508    0.0548535   0.00838342
thread0::lookup_table                      15027       479.099     392.001286 (0.818205)   87.097997 (0.181795)    0.024349    9.00457     0.0318826   0.00730906
thread0::elementwise_add                   15027       345.643     302.006333 (0.873752)   43.636814 (0.126248)    0.017625    3.1705      0.0230015   0.00527307
thread0::GpuMemcpyAsync:CPU->GPU           20036       229.869     198.181250 (0.862150)   31.687294 (0.137850)    0.008188    2.08116     0.0114728   0.00350684
thread0::scale                             10018       224.898     210.406303 (0.935565)   14.491370 (0.064435)    0.018757    0.143304    0.0224494   0.003431
thread0::slice                             5009        169.833     151.921321 (0.894535)   17.911378 (0.105465)    0.030184    0.072767    0.0339055   0.00259094
thread0::matmul                            5009        163.061     149.402692 (0.916239)   13.658130 (0.083761)    0.028052    0.956259    0.0325536   0.00248763
thread0::tanh                              5009        116.672     106.307159 (0.911166)   10.364377 (0.088834)    0.020194    0.067294    0.0232924   0.00177992
thread0::feed                              20036       71.6392     71.630728 (0.999882)    0.008476 (0.000118)     0.002808    0.034718    0.00357552  0.00109292
Xreki commented 4 years ago

QA测试结果

Xreki commented 4 years ago

二期优化工作

版本 P4时间 (ms) 记录时间 PR 版本描述 加速效果
0 10 2020-02-04 - baseline,CUDA10的时间比CUDA9的时间长 -
1 9.5 2020-02-04 https://github.com/PaddlePaddle/Paddle/pull/22424
https://github.com/PaddlePaddle/Paddle/pull/22486
将3个fc也融合到了multihead_matmul_op中,将3个小的GEMM合成一个大的GEMM计算 5%
2 8.4 2020-02-04 https://github.com/PaddlePaddle/Paddle/pull/22424
https://github.com/PaddlePaddle/Paddle/pull/22494
融合3个lookup_table+elementwise_add+layer_norm,实现fused_embedding_eltwise_layernorm_op 11.6%
3 7.8 2020-02-05 fused_fc_elementwise_layernorm_op中计算layer_norm,使用double类型进行block内的reduce操作,改成float类型

embedding_eltwise_layernorm_fuse_pass匹配的子图模式

image

P4的浮点计算能力

image