Closed jianhang-liu closed 4 years ago
1) Improve (vectorize/parallelize) threshold adding in softmax fwd for traning 2) detiled profiling of NLP traning (BERT traning would be good) 3) Test mkl-dnn jit softmax when it is available in 0.x mkl-dnn 4) Improve softmax bwd : remove PD creation that happens every iteration.
@jczaja Does all above 4 improvement for softmax_mkldnn_op?
Yes. Do you suggest looking at Nonmkl-DNN as well ? @tensor-tang implemented JIT softmax in paddlepaddlle , but only for inference. Extending his code to work for forward-training (if only instruction set used is matching your requirements)could be useful when softmax without mkl-dnn is used
Since softmax_mkldnn_grad_op
speedup 10x in our application, we prefer softmax_mkldnn_op
at first.
@jczaja Could you create a PR when you finish the first one improvement Improve (vectorize/parallelize) threshold adding in softmax fwd for training
?
@luotao1 Sure, Those points it just things I will try. Not that all of them will be implemented in one PR. Ok, We will look at softmax_mkldnn first.
@luotao1 Just talked with Jacek. The current "extreamly slow" performance of Softmax FWD looks quite strange to us. There should be something wrong. Jacek will prepare a branch to add some debug code and ask for your help to run model and collect log. That should happen within today (PL time).
@luotao1 I would like you to execute following experiments (use MKL-DNN execution) so we gather more data on poor softmax fwd execution.
1) Please run training (two iterations will be enough) of ContentDNN with MKLDNN_VERBOSE=1 , building Paddle from following branch:
https://github.com/jczaja/Paddle/tree/prv-diagnostic-mkldnn
Reasoning: The branch contains more info from MKL-DNN and output will contain info on which MKL-DNN softmax ref implementation is used eg. generic or dense
2) Please run profiling of training (the way original profiling, we were sent, was made ) of ContentDNN using following branch of Paddle:
https://github.com/jczaja/Paddle/tree/prv-softmax-profiling
Reasoning: The branch contains additional profiling code inside of softmax fwd.
3) If possible please run profiling of inference of ContentDNN from develop branch .
Reasoning: Knowing inference of ContentDNN profiling We can compare to training profiling of ContentDNN and decide if optimizing of ValueClip makes sense .
Please send us gathered output.
@jczaja Does step2 depends on step1? I deal with step2 independent with step1.
2019-05-14 06:58:06,911-INFO: 0 train_file -> armm.final.train.1000
2019-05-14 06:58:06,911-INFO: epoch 0 start
2019-05-14 06:58:07,489-INFO: epoch: 0, iter: 0, loss: 0.000998659431934 queue size: 64
2019-05-14 06:58:07,489-INFO: acc: 0.75 , correct: 12, total: 16, speed: 277.038613091
2019-05-14 06:58:11,058-INFO: epoch: 0, iter: 10, loss: 0.00984486818314 queue size: 64
2019-05-14 06:58:11,059-INFO: acc: 0.681818181818 , correct: 120, total: 176, speed: 44.8280110942
2019-05-14 06:58:13,661-INFO: epoch: 0, iter: 20, loss: 0.00980018548667 queue size: 64
2019-05-14 06:58:13,661-INFO: acc: 0.660714285714 , correct: 222, total: 336, speed: 61.4796743528
@jczaja Does it need enable MKLDNN_VERBOSE when running step2? I told Luo Tao it's not necessary since your code directly use cout to print log.
decide if optimizing of ValueClip makes sense .
I think it may help less.
@luotao1 Regarding step 1. Please provide full log not only MKLDNN_VERBOSE=1 part. I asked for MKLDNN_VERBOSE just to make sure MKLDNN is in use. But I need other part of logs, as I just put printf into MKL-DNN . To be more precise I need lines like this: ===> Softmax forward DENSE and ===> Softmax forward GENERIC
===> Softmax forward DENSE and ===> Softmax forward GENERIC
I don't see any log like this.
@luotao1 When running configuration eg. cmake ... , There is PaddlePaddle version/commit mentioned used. could you please paste/confirm that line eg.
commit: df4872968
This is a branch that should be used for step 1. Please confirm
for step1, the git log
result is:
commit f0ff32523932e1f7ddae0a856a4f357c1ad86608
Merge: b123002 df48729
Author: Tao Luo <luotao02@baidu.com>
Date: Tue May 14 11:30:14 2019 +0800
Merge branch 'prv-diagnostic-mkldnn' of https://github.com/jczaja/Paddle into jczaja_softmax
commit b123002f82a008892921c856ddca555c0b308b2d
@luotao1 Could you please build step 1 branch for unit tests and run: ctest -R test_softmax_mkldnn -VV
and tell me if you can see "===> Softmax forward ..." messages
@luotao1 Softmax forward DENSE is faster kernel unless MKL-DNN is build without MKL. PaddlePaddle is build with MKL so this should be faster kernel.
Thanks very much for sending logs. From step1 we see that proper mkl-dnn implementation is called. From step2 we see that ~99 % of softmax fwd time is spent in Execution eg. in MKL-DNN. This suggest that optimizing ValueClip does not make much sense for contentDNN.
So next two things I want to test is: 1) if whether non-MKL-DNN op is called along with mkl-dnn op. For example some softmax ops are having use_mkldnn=True some others use_mkldnn=False. That would explain poor performance and presence of softmax entries in MKLDNN_VERBOSE. I will prepare another diagnostic branch today . 2) Softmax mkl-dnn op is not supporting axis param of normalization. Reason is that axis was introduced after softmax_mkldnn was added. So softmax_mkldnn is always normalizing after final dim . So my question is if contentDNN is using axis param with value diffrent than -1 (default value) ?
So my question is if contentDNN is using axis param with value diffrent than -1 (default value) ?
It is using axis param with the default value -1 in this model.
@luotao1 I have seen reference code for softmax as used in your internal framework. SoftmaxLayer::ff is performing normalization on all input data eg. It is assumed that input Tensor is one dimensional, but ContentDNN seems to use 2D/3D data for softmax. If SoftmaxLayer::ff is to functionally match what PaddlePaddle'softmax is doing then outer dim (for NCW , N=1,C=1)has to be =1 has to be used or SoftmaxLayer::ff is called BS times per each call of PaddlePaddle softmax. If SoftmaxLayer::ff is to compute softmax for whole Tensor regardless outer dim( N in 2D , N,C in 3D) then it will be much faster than paddle but functionally result will be different
questions: 1) What BatchSize are you using ? 2) If Batch size is bigger than 1 (let's assume 16) , then is single SoftmaxLayer::ff to compute softmax for all 16 inputs or will SoftmaxLayer::ff be called 16 times?
SeqSoftmaxLayer::ff
. And it calls 16 times.
- SeqSoftmaxLayer::ff has to be called for each batch separately eg. 16 times for [N=16, C, W]
No, it calls 1 time for SeqSoftmaxLayer::ff
, and there is for (int i = 0; i < seq_size; ++i)
in SeqSoftmaxLayer::ff
. i.e, SeqSoftmaxLayer::ff
internel computes input one by one.
- So in internal framework it would corresspond 4560*16 calls to SeqSoftmaxLayer
No, there is 4560 calls to SeqSoftmaxLayer.
- Are you looking at TotalValue or Average execution of SeqSoftmaxLayer
We look the TotalValue.
export MKLDNN_VERBOSE=1
?Before SeqSoftmaxLayer you have some flatten layer to Resize N , C , W --> N*C, W
The input of SeqSoftmaxLayer has been flattened already, that's why Paddle uses additional transpose2 op to do this.
Thanks for answers and logs. They are very helpful.
Meantime we inspected slowness of softmax MKL-DNN as from previous findings we know that 99% of time of softmax fwd op is spent in MKL-DNN . Having MKLDNN_VERBOSE output from ContentDNN we can see : mkldnn_verbose,info,Intel(R) MKL-DNN v0.18.0 (Git Hash 863ff6e7042cec7d2e29897fe9f0872e0888b0fc),Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) AVX2 is used for MKL-DNN and also some timings for softmax execution (in bold): mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,15.856 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,16.625 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,19.158 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,23.78 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,27.8979 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,20.406 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,15.55
We modified unit test to execute softmax of same dims on MKL-DNN and results are: mkldnn_verbose,info,Intel(R) MKL-DNN v0.18.0 (Git Hash 863ff6e7042cec7d2e29897fe9f0872e0888b0fc),Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2)
mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,21.1709 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.567139 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.401855 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.408203 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.419922 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.382812 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.441895 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.382812 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.430908
From comparison of times we can see that results We gathered from UT are much lower (apart from first one) comparing to those taken from ContentDNN training. So I would like you get us some more data:
1) Please run paddle: _MKLDNNVERBOSE=1 _ctest -R test_softmaxmkldnn -VV It will take a lots of time to finish so, please interrupt it after first hundred lines and send us log For that step please use following branch: https://github.com/jczaja/Paddle/tree/prv-softmax-ut-experiment 2) Please run paddle: _MKL_VERBOSE=1_ _MKLDNNVERBOSE=1 _ctest -R test_softmaxmkldnn -VV First 100 lines will be enough. Use the same branch as in step 1 3) Please paste output of lscpu from machine you are running training of ContentDNN.
Local machine I used for presented data: Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
I test on three different machines, but we have the same problem:
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping : 1
microcode : 0xb000021
cpu MHz : 2101.000
cache size : 20480 KB
physical id : 1
siblings : 8
core id : 7
cpu cores : 8
apicid : 30
initial apicid : 30
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc
bogomips : 4195.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
CPU socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Stepping: 1
CPU MHz: 2101.000
BogoMIPS: 4195.99
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
stepping : 1
microcode : 0xb00001b
cpu MHz : 2200.000
cache size : 30720 KB
physical id : 1
siblings : 12
core id : 13
cpu cores : 12
apicid : 58
initial apicid : 58
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc
bogomips : 4405.17
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 12
CPU socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Stepping: 1
CPU MHz: 2200.000
BogoMIPS: 4405.17
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-23
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz
stepping : 4
microcode : 0x416
cpu MHz : 2500.000
cache size : 20480 KB
physical id : 1
siblings : 16
core id : 7
cpu cores : 8
apicid : 47
initial apicid : 47
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bogomips : 5005.16
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
CPU socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2500.000
BogoMIPS: 5005.16
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-31
@luotao1 Just a small comment. Logs presented above are suggesting that input data for softmax may not be in a cache memory. Possible reason is that model is quite broad/complex and before softmax got to execute data it is already being eviced from cache. As your internal framework is working fine on this model it may mean that PaddlePaddle as framework got a bit more overhead than internal framework and rewritting softmax into JIT will help a bit, but it may be that not that much as it is expected. I would suggest checking performance of JIT solution on that model as soon as possible (as soon as forward JIT softmax works for training)
PaddlePaddle as framework got a bit more overhead
The framework overhead is quite small on this model.
thread0::softmax 4560 19986 0.01525 21.8816 4.38289 0.0325586
thread0::softmax_compute 4560 19928.6 0.00504 21.8569 4.37031 0.0324652
softmax_compute means the compute
function of softmax_op.
step1_ut.log step2_ut.log @jczaja please see the ut log.
Thanks very much for logs, they suggest that machines are fine.
I'm sorry I did not write it very clear. If PaddlePaddle is doing some reads and writes in between of transpose and corressponding softmax (more than internal framework) ops then output from transpose may no longer be in a cache and then softmax has to take input from RAM rather than from cache memory so then even if JIT is used it may be working slower due to waiting time to get data from RAM memory. Anyway Hopefully all is fine with JIT softmax performance for ContentDNN.
@Jacek, so what's the next step in your side to further check MKLDNN Softmax FWD?
@jianhang-liu Next step should be to enable softmax JIT of PaddlePaddle for this model . If this works very fast then MKL-DNN softmax implementation is poor if JIT softmax is not performing well then I expect that execution is waiting for input data to be fetched from memory and this is some inefficiency in PaddlePaddle. If @tensor-tang is to look at it very soon then I do not have anything to do here and I can only check if MKL-DNN team has some update on their softmax implementation imporovement and test it if possible. Do you have some other suggestions?
@jczaja @tensor-tang @luotao1 Let's recap where we are after a few days investigation.
What's ongoing now:
@luotao1 Could you please,
In both scenarios functional problem will arise eg. convergence problems as those are not production quality branches. We are checking performance.
diff --git a/paddle/fluid/operators/softmax_op.h b/paddle/fluid/operators/softmax_op.h
index a964c3b..56b554d 100644
--- a/paddle/fluid/operators/softmax_op.h
+++ b/paddle/fluid/operators/softmax_op.h
@@ -64,15 +64,12 @@ class SoftmaxKernel : public framework::OpKernel<T> {
X_2d.ShareDataWith(*X).Resize({n, d});
Out_2d.ShareDataWith(*Out).Resize({n, d});
-#ifdef PADDLE_ON_INFERENCE
math::SoftmaxFunctor<DeviceContext, T, true>()(
context.template device_context<DeviceContext>(), axis_dim, &X_2d,
&Out_2d);
-#else
- math::SoftmaxFunctor<DeviceContext, T, false>()(
- context.template device_context<DeviceContext>(), axis_dim, &X_2d,
- &Out_2d);
-#endif
@jczaja @jianhang-liu @tensor-tang I directly use inference jit softmax fwd, but it is slow as before.
- Check performance imporvment of mkl-dnn softmax when e^x is replaced with memcpy
The performance is 8x slower than before.
- Check performance improvement when using jitted softmax forward.
see https://github.com/PaddlePaddle/Paddle/issues/17268#issuecomment-493844656
@luotao1 Could you please provide log (MKLDNN_VERBOSE + paddle profiling) for step 1 ?
@luotao1 Regarding step 1 , 8 x slower with memcpy instead of vsexp is very surprising.
I was able to look into detailed log of contentDNN (VLOG output) training . I looked what happens in-between of transpose2 and corresponding softmax . For slowest one eg. 16x100x100 , we have few other ops execution on diffrent data than what is used in softmax. So we suspected that perhaps buffers executed after transpose2 and before softmax are overwritting data cache and so input data has to be fetched from main memory. But VLOG logs suggest that around ~3 MB of data is processed in between so not enough to evict L3 cache (if only other core is actually doing computation). Question: 1) L3 cache is shared among cores. I'm guessing that no other work happens on entire platform. Could you please make sure no other VM/dockers running on machine where you test and then check performance?
On the other hand I heard from @jianhang-liu that @tensor-tang put internal framework softmax into PaddlePaddle and performance was largly improved. If you could share your findings that would be great
- Could you please make sure no other VM/dockers running on machine where you test and then check performance?
Yes, no other VM/dockers running.
If you could share your findings that would be great
I have sent email to you.
Could you please provide log (MKLDNN_VERBOSE + paddle profiling) for step 1
For step1, it will cause err after iteration 600. Thus, I paste the log (MKLDNN_VERBOSE + paddle profiling) for iteration 400.
memcpy.log
From the log, we will find mul_grad
cause a lot of times before, but softmax
speedups.
thread0::mul_grad 19152 47011 0.010904 26.5146 2.45463 0.18187
thread0::softmax 4560 18687.2 0.025755 30.3115 4.09807 0.0722947
thread0::mul_grad 8400 593414 0.011221 1251.54 70.6445 0.884324
thread0::softmax 2000 313.278 0.030751 0.679793 0.156639 0.000466857
@jczaja @jianhang-liu @tensor-tang I directly use inference jit softmax fwd, but it is slow as before.
@luotao1 actually jit helps a little from @zhupengyang 's test
from 17151 => 13091
The reason why not helps much is that the situation on content dnn have changing size and small ones. jit creating takes time.
The better solution for this case is proposed at #17522
This issue is closed by PR #17522 and #17534 from @tensor-tang . Meanwhile, Intel MKLDNN team is also working on a JIT version Softmax and will be in v1.0 (and possiblly backport to v0.1x also).
Need optimize Softmax (fwd+bwd) for CPU training