Open Xreki opened 5 years ago
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::mul 370666 32332.5 12646.616698 (0.391142) 19685.930916 (0.608858) 0.026226 95.3523 0.0872283 0.341415
thread0::elementwise_add 561008 16561.7 14064.876774 (0.849241) 2496.820884 (0.150759) 0.017456 9.10075 0.0295213 0.174883
thread0::softmax 60108 12311.9 4285.399399 (0.348069) 8026.520592 (0.651931) 0.18358 6.97239 0.20483 0.130007
thread0::matmul 125225 7621.9 4999.121478 (0.655889) 2622.776986 (0.344111) 0.026823 6.93527 0.0608656 0.0804832
thread0::layer_norm 125225 6890.62 3494.377954 (0.507121) 3396.237224 (0.492879) 0.046834 6.54613 0.0550259 0.0727612
thread0::transpose2 240432 6706.17 4940.736832 (0.736745) 1765.436248 (0.263255) 0.022342 6.87583 0.0278922 0.0708136
thread0::dropout 190342 4844.09 4380.131806 (0.904222) 463.956297 (0.095778) 0.019757 1.91613 0.0254494 0.051151
thread0::reshape2 240432 1970.44 1970.427168 (0.999993) 0.013691 (0.000007) 0.005761 0.758181 0.00819542 0.0208068
thread0::scale 70126 1559.56 1413.281218 (0.906202) 146.283474 (0.093798) 0.017739 6.40694 0.0222395 0.0164682
thread0::relu 60108 1533.83 1155.052049 (0.753053) 378.773663 (0.246947) 0.020695 1.91584 0.0255178 0.0161964
thread0::stack 5009 670.06 620.412429 (0.925906) 49.647450 (0.074094) 0.104312 3.43668 0.133771 0.00707548
thread0::lookup_table 15027 499.303 409.325315 (0.819793) 89.978163 (0.180207) 0.025125 0.087498 0.0332271 0.00527238
thread0::fetch 5009 260.127 260.068816 (0.999777) 0.057970 (0.000223) 0.044476 1.01272 0.0519319 0.0027468
thread0::TensorCopy:GPU->CPU 5009 210.123 210.076980 (0.999782) 0.045869 (0.000218) 0.036519 0.996315 0.0419491 0.00221879
thread0::GpuMemcpyAsync:CPU->GPU 20036 204.476 173.512282 (0.848571) 30.963505 (0.151429) 0.007 1.77937 0.0102054 0.00215916
thread0::GpuMemcpySync:GPU->CPU 5009 203.349 194.875273 (0.958327) 8.474119 (0.041673) 0.035868 0.330009 0.0405968 0.00214726
thread0::slice 5009 153.525 136.187490 (0.887070) 17.337539 (0.112930) 0.025069 1.98001 0.0306498 0.00162114
thread0::tanh 5009 110.729 100.944808 (0.911638) 9.784189 (0.088362) 0.017612 1.92101 0.022106 0.00116924
thread0::feed 20036 57.2692 57.263221 (0.999896) 0.005970 (0.000104) 0.00156 0.032812 0.00285831 0.000604732
总体来说,GPU利用的比较满。
开始时GPU存在不少空闲
stack op里面有cuMalloc和cuFree操作,并且有2次cuda stream的同步操作
很多的reshape和transpose操作
softmax是否使用cudnn?
Transpose(Reshape(X*W0+b0))
Transpose(Reshape(X*W1+b1))
Transpose(Reshape(X*W2+b2))*scale
-> Transpose(Reshape(X*W2*scale+b2*scale))
。该分支经过变换之后,结构与第1、2分支完全一样Transpose(Reshape(X*[W0+W1+W2*scale]+[b0+b1+b2*scale]))
[ ] elementwise + softmax融合,@zhaoyuchen2018
[x] elementwise + layer_norm融合,@Xreki
[ ] LN+FC可以融合,融合以后,LN的gamma和beta都可以删除
I0814 10:13:53.083331 11085 operator.cc:170] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 768]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 768]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
I0814 10:13:53.083364 11085 operator.cc:993] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0814 10:13:53.083400 11085 tensor_util.cu:28] TensorCopy 1, 128, 12, 64 from CPUPlace to CPUPlace
I0814 10:13:53.083407 11085 tensor_util.cu:43] Skip copy the same data async from CPUPlace to CPUPlace
I0814 10:13:53.083420 11085 operator.cc:190] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 12, 64]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 12, 64]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
版本 | P40时间 (Total) | 时间 (去掉第一次run) | 测试时间 | PR | 版本描述 | 加速效果 |
---|---|---|---|---|---|---|
0 | 8.36 | 2019-08-14 | - | baseline | - | |
1 (merged) | 8.22 | 8.14 | 2019-08-22 | https://github.com/PaddlePaddle/Paddle/pull/19297 | 移除dropout_op | 1.7% |
2 (merged) | 7.37 | 6.96 | 2019-08-22 | https://github.com/PaddlePaddle/Paddle/pull/19346 | softmax使用cudnn | 10% |
3 (merged) | 7.20 | 6.82 | 2019-09-10 | https://github.com/PaddlePaddle/Paddle/pull/19687 https://github.com/PaddlePaddle/Paddle/pull/19733 |
融合FC计算,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-529325915 | 2.1% |
4 (merged) | 6.93 | 6.55 | 2019-09-11 | https://github.com/PaddlePaddle/Paddle/pull/19776 | 融合FC+elementwise_add+layer_norm,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-530245969 | 4% |
5 (merged) | 6.13 | 5.73 | 2019-09-19 | https://github.com/PaddlePaddle/Paddle/pull/20167 https://github.com/PaddlePaddle/Paddle/pull/20151 | 融合multi-head attention,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-555797150 | 14% |
3 | 7.26 | 6.88 | 2019-09-03 | https://github.com/PaddlePaddle/Paddle/pull/19610 | 优化stack_op,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-527316205 | 1.2% |
在版本1的基础上,在预测了使用了fuse_elewise_add_act_pass
,可以融合elementwise_add
+relu
的计算,并成功匹配到14个子图,但总体性能下降。
+ /paddle/build_paddle/build_docker_manylinux_cuda90/paddle/fluid/inference/tests/api/samples/ernie_tester --logtostderr --model_dir=/data/ernie/model --data=/data/ernie/seq128_data/test_ds --repeat=1 --warmup_steps=1 --use_gpu=true --use_analysis=true --print_outputs=false --profile=false
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ernie
[ RUN ] ernie.profile
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [fuse_elewise_add_act_pass]
--- detected 14 subgraphs
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0828 08:49:37.436837 7631 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0828 08:49:37.726634 7631 analysis_predictor.cc:475] == optimize end ==
I0828 08:49:38.710613 7631 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W0828 08:49:39.109542 7631 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
W0828 08:49:39.115780 7631 device_context.cc:269] device: 0, cuDNN Version: 7.3.
W0828 08:49:39.116070 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.116091 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.116104 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.127769 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.127822 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.127832 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.137923 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.137972 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.137981 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.147920 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.147967 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.147977 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.157951 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.157996 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.158005 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
I0828 08:50:21.810214 7631 ernie_tester.cc:345] Run 5010 samples, average latency: 8.60096 ms per sample.
I0828 08:50:21.810283 7631 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 8.51942 ms per sample.
[ OK ] ernie.profile (46355 ms)
[----------] 1 test from ernie (46355 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (46356 ms total)
[ PASSED ] 1 test.
融合3个path的fc,reshape,transpose,scale到一个path. P40上测试结果如下:
融合后的图如下所示: 融合前:
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::mul 250450 30086.2 10700.311709 (0.355655) 19385.885160 (0.644345) 0.027581 88.3573 0.120129 0.39598
thread0::elementwise_add 440792 13893.1 11593.975031 (0.834514) 2299.111468 (0.165486) 0.019193 12.8221 0.0315185 0.182854
thread0::matmul 125225 7716.83 5090.761228 (0.659696) 2626.072493 (0.340304) 0.02993 7.32385 0.0616237 0.101565
thread0::layer_norm 125225 7091.46 3754.750949 (0.529475) 3336.704465 (0.470525) 0.048728 7.12371 0.0566297 0.0933344
thread0::softmax 60108 4259.71 1991.152624 (0.467438) 2268.559610 (0.532562) 0.063841 9.24836 0.0708676 0.0560643
thread0::transpose2 120216 4126.7 2756.571446 (0.667985) 1370.123630 (0.332015) 0.025956 9.50509 0.0343273 0.0543136
thread0::split 60108 2998.5 1830.882725 (0.610599) 1167.620505 (0.389401) 0.043009 19.4484 0.0498853 0.0394649
thread0::relu 60108 1620.75 1233.311364 (0.760953) 387.433865 (0.239047) 0.021955 9.30877 0.0269639 0.0213315
thread0::reshape2 120216 1279.38 1279.361341 (0.999987) 0.016020 (0.000013) 0.008744 1.83394 0.0106423 0.0168386
thread0::stack 5009 673.79 610.591735 (0.906205) 63.198156 (0.093795) 0.110177 11.1131 0.134516 0.0088681
thread0::lookup_table 15027 517.377 424.928699 (0.821314) 92.448017 (0.178686) 0.025679 6.76539 0.0344298 0.00680947
thread0::fetch 5009 342.348 342.270593 (0.999773) 0.077819 (0.000227) 0.057001 1.4229 0.0683467 0.00450583
thread0::TensorCopy:GPU->CPU 5009 261.036 260.979443 (0.999784) 0.056311 (0.000216) 0.04429 0.284401 0.0521133 0.00343563
thread0::scale 10018 253.319 237.712034 (0.938391) 15.606845 (0.061609) 0.020368 0.161032 0.0252864 0.00333406
thread0::GpuMemcpyAsync:CPU->GPU 20036 248.089 208.310791 (0.839661) 39.778466 (0.160339) 0.008811 1.84374 0.0123822 0.00326523
thread0::GpuMemcpySync:GPU->CPU 5009 241.886 232.381808 (0.960708) 9.504255 (0.039292) 0.040716 0.390521 0.0482903 0.00318359
thread0::slice 5009 160.703 143.249188 (0.891391) 17.453703 (0.108609) 0.02768 0.07693 0.0320828 0.0021151
thread0::tanh 5009 117.589 107.550767 (0.914632) 10.038396 (0.085368) 0.01952 0.884425 0.0234756 0.00154765
thread0::feed 20036 90.2748 90.263520 (0.999875) 0.011311 (0.000125) 0.002949 0.964233 0.00450563 0.00118815
stack
与concat
的功能类似。尝试:
concat
代替stack
,运行出现错误,发现stack
和concat
输出的维度不一样,不可行stack
输出的维度
I0820 09:17:21.934635 17390 operator.cc:168] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.934787 17390 operator.cc:188] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
.
.
.
I0820 09:17:21.935680 17390 operator.cc:168] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.935708 17390 operator.cc:188] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
concat
输出的维度
I0903 05:49:33.774837 17211 operator.cc:165] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:[0]({{}})]}.
I0903 05:49:33.774863 17211 operator.cc:988] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0903 05:49:33.774976 17211 operator.cc:185] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:float[1, 1536, 128]({})]}.
.
.
.
I0903 05:49:33.775012 17211 operator.cc:165] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 1536, 128]({})]}, outputs:{Out[tmp_2:[0]({{}})]}.
concat
,重新实现stack
I0903 09:57:36.458448 29283 ernie_tester.cc:345] Run 5010 samples, average latency: 7.26373 ms per sample.
I0903 09:57:36.458526 29283 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.87974 ms per sample.
以上测试结果基于版本2。
I0920 09:15:41.505820 25415 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92085 ms per sample.
I0920 09:15:41.505903 25415 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.51521 ms per sample.
以上测试结果基于版本4。
实现fc的GPU kernel :https://github.com/PaddlePaddle/Paddle/pull/19687 FC的融合:https://github.com/PaddlePaddle/Paddle/pull/19733
fc=mul+elementwise_add+relu,主要是将fc中的elementwise_add和relu计算融合起来。
测试结果,加速效果:2.1%
--- Running IR pass [fc_fuse_pass]
--- detected 12 subgraphs
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0909 06:52:40.161475 7826 fc_fuse_pass.cc:122] with_relu: 1, found_fc_count: 12
--- detected 62 subgraphs
I0909 06:52:40.214682 7826 fc_fuse_pass.cc:122] with_relu: 0, found_fc_count: 74
.
.
.
I0909 06:51:05.903086 7821 ernie_tester.cc:345] Run 5010 samples, average latency: 7.19529 ms per sample.
I0909 06:51:05.903168 7821 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.81763 ms per sample.
Profile
-------------------------> Profiling Report <-------------------------
Place: All Time unit: ms Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. thread0::fc 370666 37927.8 16796.211849 (0.442847) 21131.552513 (0.557153) 0.038417 81.8839 0.102323 0.511083 thread0::matmul 125225 7542.57 4853.428136 (0.643472) 2689.138129 (0.356528) 0.028322 6.91776 0.0602321 0.101637 thread0::layer_norm 125225 6785.64 3458.361240 (0.509659) 3327.277901 (0.490341) 0.046506 6.78461 0.0541876 0.0914376 thread0::transpose2 240432 6648.82 4993.952105 (0.751103) 1654.869939 (0.248897) 0.022683 6.97236 0.0276536 0.089594 thread0::elementwise_add 190342 4846.03 3903.958820 (0.805599) 942.072546 (0.194401) 0.018709 7.06821 0.0254596 0.0653011 thread0::softmax 60108 4155.86 1873.948821 (0.450917) 2281.914897 (0.549083) 0.062806 7.22235 0.0691399 0.056001 thread0::reshape2 240432 2117.24 2117.212819 (0.999986) 0.030675 (0.000014) 0.006925 2.54413 0.008806 0.0285302 thread0::scale 70126 1639.13 1402.350356 (0.855547) 236.777127 (0.144453) 0.017914 3.11279 0.023374 0.0220875 thread0::stack 5009 651.86 601.077905 (0.922097) 50.782113 (0.077903) 0.106259 4.12809 0.130138 0.00878393 thread0::lookup_table 15027 479.71 388.084297 (0.808999) 91.625230 (0.191001) 0.024586 0.085041 0.0319232 0.00646417 thread0::fetch 5009 333.739 333.664085 (0.999776) 0.074696 (0.000224) 0.054763 2.98519 0.0666278 0.00449719 thread0::TensorCopy:GPU->CPU 5009 264.276 264.216676 (0.999777) 0.059003 (0.000223) 0.043246 2.97079 0.0527602 0.00356116 thread0::GpuMemcpySync:GPU->CPU 5009 244.643 235.804134 (0.963870) 8.839005 (0.036130) 0.04009 1.58306 0.0488407 0.00329661 thread0::GpuMemcpyAsync:CPU->GPU 20036 226.793 195.272127 (0.861014) 31.521131 (0.138986) 0.007882 1.17277 0.0113193 0.00305608 thread0::slice 5009 147.886 130.195979 (0.880383) 17.689569 (0.119617) 0.025414 0.103978 0.029524 0.00199278 thread0::tanh 5009 122.009 112.002519 (0.917989) 10.006004 (0.082011) 0.019717 0.094528 0.0243579 0.00164409 thread0::feed 20036 76.5896 76.581383 (0.999892) 0.008248 (0.000108) 0.002877 0.308487 0.0038226 0.00103206
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
--- detected 24 subgraphs
.
.
.
I0911 06:41:55.494792 7238 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92849 ms per sample.
I0911 06:41:55.494873 7238 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.54723 ms per sample.
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fc 255459 24269.7 12361.669096 (0.509346) 11908.043273 (0.490654) 0.039433 100.758 0.0950043 0.346056
thread0::fused_fc_elementwise_layernorm 115207 17699.9 6439.933305 (0.363841) 11259.921061 (0.636159) 0.103025 6.90034 0.153635 0.252378
thread0::matmul 125225 7670.31 4985.129880 (0.649925) 2685.183125 (0.350075) 0.029412 6.91239 0.0612522 0.109369
thread0::transpose2 240432 6889.05 5254.959850 (0.762799) 1634.087984 (0.237201) 0.023907 6.72677 0.0286528 0.0982293
thread0::softmax 60108 4204.46 1927.686603 (0.458486) 2276.771181 (0.541514) 0.064249 2.83378 0.0699484 0.0599504
thread0::reshape2 240432 2228.86 2228.844417 (0.999994) 0.014348 (0.000006) 0.007575 3.75591 0.00927023 0.0317808
thread0::elementwise_add 75135 2160.14 1641.503732 (0.759904) 518.641058 (0.240096) 0.019577 6.72642 0.0287502 0.030801
thread0::scale 70126 1736.05 1498.515875 (0.863178) 237.529504 (0.136822) 0.018853 6.66338 0.0247561 0.0247539
thread0::stack 5009 664.48 615.254962 (0.925920) 49.224601 (0.074080) 0.10776 3.887 0.132657 0.00947466
thread0::layer_norm 10018 592.675 324.577447 (0.547648) 268.097787 (0.452352) 0.050985 6.49622 0.059161 0.00845082
thread0::lookup_table 15027 499.198 412.769395 (0.826866) 86.428197 (0.173134) 0.025502 6.66372 0.03322 0.00711794
thread0::fetch 5009 362.799 362.676040 (0.999660) 0.123295 (0.000340) 0.05508 0.625043 0.0724295 0.00517307
thread0::TensorCopy:GPU->CPU 5009 297.886 297.782095 (0.999652) 0.103633 (0.000348) 0.043194 0.612426 0.0594701 0.00424748
thread0::GpuMemcpySync:GPU->CPU 5009 278.55 270.104589 (0.969679) 8.445858 (0.030321) 0.039548 0.609273 0.05561 0.00397179
thread0::GpuMemcpyAsync:CPU->GPU 20036 230.007 198.920144 (0.864846) 31.086396 (0.135154) 0.008387 0.708778 0.0114797 0.00327961
thread0::slice 5009 152.385 135.526127 (0.889365) 16.859146 (0.110635) 0.026901 0.06877 0.0304223 0.00217283
thread0::tanh 5009 122.456 112.432469 (0.918148) 10.023237 (0.081852) 0.020671 0.070096 0.0244471 0.00174607
thread0::feed 20036 73.4209 73.411569 (0.999873) 0.009355 (0.000127) 0.002908 0.042732 0.00366445 0.00104689
Before
After
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
I1120 01:40:10.299883 6510 graph_pattern_detector.cc:97] --- detected 12 subgraphs
--- Running IR pass [fc_fuse_pass]
I1120 01:40:10.344569 6510 graph_pattern_detector.cc:97] --- detected 12 subgraphs
I1120 01:40:10.357105 6510 graph_pattern_detector.cc:97] --- detected 26 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1120 01:40:10.397455 6510 graph_pattern_detector.cc:97] --- detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1120 01:40:10.509421 6510 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I1120 01:40:10.758424 6510 analysis_predictor.cc:474] ======= optimize end =======
I1120 01:40:11.764863 6510 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W1120 01:40:12.444176 6510 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
.
.
.
I1120 01:40:43.862216 6510 ernie_tester.cc:345] Run 5010 samples, average latency: 6.40495 ms per sample.
I1120 01:40:43.862303 6510 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 5.74544 ms per sample.
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fused_fc_elementwise_layernorm 115207 17021.4 5933.953144 (0.348617) 11087.450962 (0.651383) 0.09927 8.71612 0.147746 0.259676
thread0::multihead_matmul 60108 11252.9 7224.504494 (0.642012) 4028.405127 (0.357988) 0.169878 6.57812 0.187212 0.171673
thread0::fc 75135 10707.1 4357.612778 (0.406985) 6349.445604 (0.593015) 0.038466 85.6766 0.142504 0.163345
thread0::mul 180324 9774.02 4994.708598 (0.511019) 4779.307233 (0.488981) 0.047804 9.68787 0.0542025 0.149111
thread0::fetch 5009 4630.95 4629.392276 (0.999664) 1.554398 (0.000336) 0.055675 1.92049 0.924525 0.0706489
thread0::TensorCopy:GPU->CPU 5009 4573.44 4571.905707 (0.999664) 1.538373 (0.000336) 0.04468 1.90973 0.913045 0.0697717
thread0::GpuMemcpySync:GPU->CPU 5009 4553.78 4543.954556 (0.997843) 9.821141 (0.002157) 0.040962 1.9062 0.909119 0.0694716
thread0::stack 5009 684.909 634.119137 (0.925844) 50.789981 (0.074156) 0.106366 2.25957 0.136736 0.0104489
thread0::layer_norm 10018 549.522 290.654864 (0.528923) 258.867095 (0.471077) 0.048854 0.142508 0.0548535 0.00838342
thread0::lookup_table 15027 479.099 392.001286 (0.818205) 87.097997 (0.181795) 0.024349 9.00457 0.0318826 0.00730906
thread0::elementwise_add 15027 345.643 302.006333 (0.873752) 43.636814 (0.126248) 0.017625 3.1705 0.0230015 0.00527307
thread0::GpuMemcpyAsync:CPU->GPU 20036 229.869 198.181250 (0.862150) 31.687294 (0.137850) 0.008188 2.08116 0.0114728 0.00350684
thread0::scale 10018 224.898 210.406303 (0.935565) 14.491370 (0.064435) 0.018757 0.143304 0.0224494 0.003431
thread0::slice 5009 169.833 151.921321 (0.894535) 17.911378 (0.105465) 0.030184 0.072767 0.0339055 0.00259094
thread0::matmul 5009 163.061 149.402692 (0.916239) 13.658130 (0.083761) 0.028052 0.956259 0.0325536 0.00248763
thread0::tanh 5009 116.672 106.307159 (0.911166) 10.364377 (0.088834) 0.020194 0.067294 0.0232924 0.00177992
thread0::feed 20036 71.6392 71.630728 (0.999882) 0.008476 (0.000118) 0.002808 0.034718 0.00357552 0.00109292
性能数据:
Seqlen128-data
batch_size | average_latency(ms) | batch_latency(ms) | gpu_memory(MB) |
---|---|---|---|
1 | 8.664 | 8.664 | 737 |
10 | 6.388 | 63.878 | 825 |
20 | 6.207 | 124.149 | 919 |
30 | 6.530 | 195.892 | 985 |
40 | 6.456 | 258.260 | 1085 |
50 | 6.445 | 322.275 | 1135 |
60 | 6.430 | 385.805 | 1279 |
70 | 6.457 | 451.965 | 1283 |
80 | 6.415 | 513.178 | 1339 |
90 | 6.481 | 583.301 | 1431 |
100 | 6.504 | 650.428 | 1485 |
Seqlen512-data
batch_size | average_latency(ms) | batch_latency(ms) | gpu_memory(MB) |
---|---|---|---|
1 | 36.123 | 36.123 | 781 |
10 | 31.663 | 316.630 | 1191 |
20 | 31.750 | 635.002 | 1637 |
30 | 32.280 | 968.412 | 2053 |
40 | 32.130 | 1285.212 | 2431 |
50 | 32.917 | 1645.835 | 2847 |
100 | 34.254 | 3425.380 | 4951 |
200 | 显存不⾜ | ||
300 | 显存不⾜ |
负责人: @NHZlX
GPU平台信息:Tesla P4
软件信息:
优化工作
版本 | P4时间 (ms) | 记录时间 | PR | 版本描述 | 加速效果 |
---|---|---|---|---|---|
0 | 10 | 2020-02-04 | - | baseline,CUDA10的时间比CUDA9的时间长 | - |
1 | 9.5 | 2020-02-04 | https://github.com/PaddlePaddle/Paddle/pull/22486 |
将3个fc也融合到了multihead_matmul_op中,将3个小的GEMM合成一个大的GEMM计算 | 5% |
2 | 8.4 | 2020-02-04 | https://github.com/PaddlePaddle/Paddle/pull/22494 |
融合3个lookup_table+elementwise_add+layer_norm,实现fused_embedding_eltwise_layernorm_op | 11.6% |
3 | 7.8 | 2020-02-05 | fused_fc_elementwise_layernorm_op中计算layer_norm,使用double类型进行block内的reduce操作,改成float类型 |
负责人
@Xreki @zhaoyuchen2018
初始性能
Paddle commit:
NVIDIA BERT推理解决方案Faster Transformer开源了