Softmax OP (fwd+bwd) need be further optimized

jianhang-liu commented 5 years ago

Need optimize Softmax (fwd+bwd) for CPU training

jczaja commented 5 years ago

1) Improve (vectorize/parallelize) threshold adding in softmax fwd for traning 2) detiled profiling of NLP traning (BERT traning would be good) 3) Test mkl-dnn jit softmax when it is available in 0.x mkl-dnn 4) Improve softmax bwd : remove PD creation that happens every iteration.

luotao1 commented 5 years ago

@jczaja Does all above 4 improvement for softmax_mkldnn_op?

jczaja commented 5 years ago

Yes. Do you suggest looking at Nonmkl-DNN as well ? @tensor-tang implemented JIT softmax in paddlepaddlle , but only for inference. Extending his code to work for forward-training (if only instruction set used is matching your requirements)could be useful when softmax without mkl-dnn is used

luotao1 commented 5 years ago

Since softmax_mkldnn_grad_op speedup 10x in our application, we prefer softmax_mkldnn_op at first.

luotao1 commented 5 years ago

@jczaja Could you create a PR when you finish the first one improvement Improve (vectorize/parallelize) threshold adding in softmax fwd for training?

jczaja commented 5 years ago

@luotao1 Sure, Those points it just things I will try. Not that all of them will be implemented in one PR. Ok, We will look at softmax_mkldnn first.

jianhang-liu commented 5 years ago

@luotao1 Just talked with Jacek. The current "extreamly slow" performance of Softmax FWD looks quite strange to us. There should be something wrong. Jacek will prepare a branch to add some debug code and ask for your help to run model and collect log. That should happen within today (PL time).

jczaja commented 5 years ago

@luotao1 I would like you to execute following experiments (use MKL-DNN execution) so we gather more data on poor softmax fwd execution. 1) Please run training (two iterations will be enough) of ContentDNN with MKLDNN_VERBOSE=1 , building Paddle from following branch: https://github.com/jczaja/Paddle/tree/prv-diagnostic-mkldnn Reasoning: The branch contains more info from MKL-DNN and output will contain info on which MKL-DNN softmax ref implementation is used eg. generic or dense 2) Please run profiling of training (the way original profiling, we were sent, was made ) of ContentDNN using following branch of Paddle: https://github.com/jczaja/Paddle/tree/prv-softmax-profiling
Reasoning: The branch contains additional profiling code inside of softmax fwd. 3) If possible please run profiling of inference of ContentDNN from develop branch . Reasoning: Knowing inference of ContentDNN profiling We can compare to training profiling of ContentDNN and decide if optimizing of ValueClip makes sense .

Please send us gathered output.

luotao1 commented 5 years ago

@jczaja Does step2 depends on step1? I deal with step2 independent with step1.

log is mkldnn_softmax.prv-diagnostic-mkldnn.iter_20.log

There is no print log for step2. I have only got:

2019-05-14 06:58:06,911-INFO: 0 train_file -> armm.final.train.1000
2019-05-14 06:58:06,911-INFO: epoch 0 start
2019-05-14 06:58:07,489-INFO: epoch: 0, iter: 0, loss: 0.000998659431934 queue size: 64
2019-05-14 06:58:07,489-INFO: acc: 0.75 , correct: 12, total: 16, speed: 277.038613091
2019-05-14 06:58:11,058-INFO: epoch: 0, iter: 10, loss: 0.00984486818314 queue size: 64
2019-05-14 06:58:11,059-INFO: acc: 0.681818181818 , correct: 120, total: 176, speed: 44.8280110942
2019-05-14 06:58:13,661-INFO: epoch: 0, iter: 20, loss: 0.00980018548667 queue size: 64
2019-05-14 06:58:13,661-INFO: acc: 0.660714285714 , correct: 222, total: 336, speed: 61.4796743528

jianhang-liu commented 5 years ago

@jczaja Does it need enable MKLDNN_VERBOSE when running step2? I told Luo Tao it's not necessary since your code directly use cout to print log.

luotao1 commented 5 years ago

I run the of inference of ContentDNN from develop branch. inference_profiler.log inference_mkldnn_verbose.log

decide if optimizing of ValueClip makes sense .

I think it may help less.

jczaja commented 5 years ago

@luotao1 Regarding step 1. Please provide full log not only MKLDNN_VERBOSE=1 part. I asked for MKLDNN_VERBOSE just to make sure MKLDNN is in use. But I need other part of logs, as I just put printf into MKL-DNN . To be more precise I need lines like this: ===> Softmax forward DENSE and ===> Softmax forward GENERIC

luotao1 commented 5 years ago

===> Softmax forward DENSE and ===> Softmax forward GENERIC

I don't see any log like this.

jczaja commented 5 years ago

@luotao1 When running configuration eg. cmake ... , There is PaddlePaddle version/commit mentioned used. could you please paste/confirm that line eg. commit: df4872968

This is a branch that should be used for step 1. Please confirm

luotao1 commented 5 years ago

for step1, the git log result is:

commit f0ff32523932e1f7ddae0a856a4f357c1ad86608
Merge: b123002 df48729
Author: Tao Luo <luotao02@baidu.com>
Date:   Tue May 14 11:30:14 2019 +0800

    Merge branch 'prv-diagnostic-mkldnn' of https://github.com/jczaja/Paddle into jczaja_softmax

commit b123002f82a008892921c856ddca555c0b308b2d

jczaja commented 5 years ago

@luotao1 Could you please build step 1 branch for unit tests and run: ctest -R test_softmax_mkldnn -VV
and tell me if you can see "===> Softmax forward ..." messages

luotao1 commented 5 years ago

@jczaja @jianhang-liu Sorry for the mistaken test. The new log is: step1.log step2.log I see Softmax forward DENSE in step1.log, is it the slower softmax kernel?

jczaja commented 5 years ago

@luotao1 Softmax forward DENSE is faster kernel unless MKL-DNN is build without MKL. PaddlePaddle is build with MKL so this should be faster kernel.

Thanks very much for sending logs. From step1 we see that proper mkl-dnn implementation is called. From step2 we see that ~99 % of softmax fwd time is spent in Execution eg. in MKL-DNN. This suggest that optimizing ValueClip does not make much sense for contentDNN.

So next two things I want to test is: 1) if whether non-MKL-DNN op is called along with mkl-dnn op. For example some softmax ops are having use_mkldnn=True some others use_mkldnn=False. That would explain poor performance and presence of softmax entries in MKLDNN_VERBOSE. I will prepare another diagnostic branch today . 2) Softmax mkl-dnn op is not supporting axis param of normalization. Reason is that axis was introduced after softmax_mkldnn was added. So softmax_mkldnn is always normalizing after final dim . So my question is if contentDNN is using axis param with value diffrent than -1 (default value) ?

luotao1 commented 5 years ago

So my question is if contentDNN is using axis param with value diffrent than -1 (default value) ?

It is using axis param with the default value -1 in this model.

jczaja commented 5 years ago

@luotao1 I have seen reference code for softmax as used in your internal framework. SoftmaxLayer::ff is performing normalization on all input data eg. It is assumed that input Tensor is one dimensional, but ContentDNN seems to use 2D/3D data for softmax. If SoftmaxLayer::ff is to functionally match what PaddlePaddle'softmax is doing then outer dim (for NCW , N=1,C=1)has to be =1 has to be used or SoftmaxLayer::ff is called BS times per each call of PaddlePaddle softmax. If SoftmaxLayer::ff is to compute softmax for whole Tensor regardless outer dim( N in 2D , N,C in 3D) then it will be much faster than paddle but functionally result will be different

questions: 1) What BatchSize are you using ? 2) If Batch size is bigger than 1 (let's assume 16) , then is single SoftmaxLayer::ff to compute softmax for all 16 inputs or will SoftmaxLayer::ff be called 16 times?

luotao1 commented 5 years ago

BatchSize=16
Sorry, it calls SeqSoftmaxLayer::ff. And it calls 16 times.

jczaja commented 5 years ago

So my understanding is that there is 3D data eg. [N=16, C, W] and SeqSoftmaxLayer::ff is working on Tensor [C,W] . Normalization happen for W dimension. SeqSoftmaxLayer::ff has to be called for each batch separately eg. 16 times for [N=16, C, W] ?
If step 1 is correct , then for PaddlePaddle contentDNN logs you sent us there were 4560 softmax ops called. So in internal framework it would corresspond 4560*16 calls to SeqSoftmaxLayer ?
How do you profile ContentDNN in internal framework to compare it with PaddlePaddle. Are you looking at TotalValue or Average execution of SeqSoftmaxLayer?

luotao1 commented 5 years ago

SeqSoftmaxLayer::ff has to be called for each batch separately eg. 16 times for [N=16, C, W]

No, it calls 1 time for SeqSoftmaxLayer::ff, and there is for (int i = 0; i < seq_size; ++i) in SeqSoftmaxLayer::ff. i.e, SeqSoftmaxLayer::ff internel computes input one by one.

So in internal framework it would corresspond 4560*16 calls to SeqSoftmaxLayer

No, there is 4560 calls to SeqSoftmaxLayer.

Are you looking at TotalValue or Average execution of SeqSoftmaxLayer

We look the TotalValue.

jczaja commented 5 years ago

Could you please run training (2 iterations) of ContentMKLDNN on following branch: https://github.com/jczaja/Paddle/tree/prv-diagnostic-softmax Reasoning: To verify if reusing of softmax primitives works properly
How many threads are used for contentDNN execution? I was told that parallel executor is used and each threads is executing training so num_instances == num_threads ? each instance is using one thread? Does the situation for internal framework looks the same eg. parallel executor , each thread is executing independently training ?
So if SeqSoftmaxLayer is called only once per batch and you have 3D data. then seq_size = N C , dim = W ? Before SeqSoftmaxLayer you have some flatten layer to Resize N , C , W --> NC, W ?

luotao1 commented 5 years ago

Should I export MKLDNN_VERBOSE=1?
Both Paddle and internal framework use 1 thread on 1 instance.
I am checking it.

jczaja commented 5 years ago

Yes, that would be helpful
So ContentDNN is using ParallelExecutor with one thread or Executor with one thread?
ok

luotao1 commented 5 years ago

The log is prv-diagnostic-softmax.log
Yes, our experiment uses only one thread for both two frameworks.

luotao1 commented 5 years ago

Before SeqSoftmaxLayer you have some flatten layer to Resize N , C , W --> N*C, W

The input of SeqSoftmaxLayer has been flattened already, that's why Paddle uses additional transpose2 op to do this.

jczaja commented 5 years ago

Thanks for answers and logs. They are very helpful.

Meantime we inspected slowness of softmax MKL-DNN as from previous findings we know that 99% of time of softmax fwd op is spent in MKL-DNN . Having MKLDNN_VERBOSE output from ContentDNN we can see : mkldnn_verbose,info,Intel(R) MKL-DNN v0.18.0 (Git Hash 863ff6e7042cec7d2e29897fe9f0872e0888b0fc),Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) AVX2 is used for MKL-DNN and also some timings for softmax execution (in bold): mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,15.856 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,16.625 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,19.158 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,23.78 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,27.8979 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,20.406 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,15.55

We modified unit test to execute softmax of same dims on MKL-DNN and results are: mkldnn_verbose,info,Intel(R) MKL-DNN v0.18.0 (Git Hash 863ff6e7042cec7d2e29897fe9f0872e0888b0fc),Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2)

mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,21.1709 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.567139 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.401855 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.408203 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.419922 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.382812 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.441895 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.382812 mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,0.430908

From comparison of times we can see that results We gathered from UT are much lower (apart from first one) comparing to those taken from ContentDNN training. So I would like you get us some more data:

1) Please run paddle: _MKLDNNVERBOSE=1 _ctest -R test_softmaxmkldnn -VV It will take a lots of time to finish so, please interrupt it after first hundred lines and send us log For that step please use following branch: https://github.com/jczaja/Paddle/tree/prv-softmax-ut-experiment 2) Please run paddle: _MKL_VERBOSE=1_ _MKLDNNVERBOSE=1 _ctest -R test_softmaxmkldnn -VV First 100 lines will be enough. Use the same branch as in step 1 3) Please paste output of lscpu from machine you are running training of ContentDNN.

Local machine I used for presented data: Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

luotao1 commented 5 years ago

I test on three different machines, but we have the same problem:

Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

processor   : 15
vendor_id   : GenuineIntel
cpu family  : 6
model       : 79
model name  : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping    : 1
microcode   : 0xb000021
cpu MHz     : 2101.000
cache size  : 20480 KB
physical id : 1
siblings    : 8
core id     : 7
cpu cores   : 8
apicid      : 30
initial apicid  : 30
fpu     : yes
fpu_exception   : yes
cpuid level : 20
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc
bogomips    : 4195.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2101.000
BogoMIPS:              4195.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

processor   : 23
vendor_id   : GenuineIntel
cpu family  : 6
model       : 79
model name  : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
stepping    : 1
microcode   : 0xb00001b
cpu MHz     : 2200.000
cache size  : 30720 KB
physical id : 1
siblings    : 12
core id     : 13
cpu cores   : 12
apicid      : 58
initial apicid  : 58
fpu     : yes
fpu_exception   : yes
cpuid level : 20
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc
bogomips    : 4405.17
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
CPU socket(s):         2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2200.000
BogoMIPS:              4405.17
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-23

Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz

processor   : 31
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x416
cpu MHz     : 2500.000
cache size  : 20480 KB
physical id : 1
siblings    : 16
core id     : 7
cpu cores   : 8
apicid      : 47
initial apicid  : 47
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bogomips    : 5005.16
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.000
BogoMIPS:              5005.16
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-31

jczaja commented 5 years ago

@luotao1 Just a small comment. Logs presented above are suggesting that input data for softmax may not be in a cache memory. Possible reason is that model is quite broad/complex and before softmax got to execute data it is already being eviced from cache. As your internal framework is working fine on this model it may mean that PaddlePaddle as framework got a bit more overhead than internal framework and rewritting softmax into JIT will help a bit, but it may be that not that much as it is expected. I would suggest checking performance of JIT solution on that model as soon as possible (as soon as forward JIT softmax works for training)

luotao1 commented 5 years ago

PaddlePaddle as framework got a bit more overhead

The framework overhead is quite small on this model.

thread0::softmax                                      4560        19986       0.01525     21.8816     4.38289     0.0325586
thread0::softmax_compute                              4560        19928.6     0.00504     21.8569     4.37031     0.0324652

softmax_compute means the compute function of softmax_op.

luotao1 commented 5 years ago

step1_ut.log step2_ut.log @jczaja please see the ut log.

jczaja commented 5 years ago

Thanks very much for logs, they suggest that machines are fine.

I'm sorry I did not write it very clear. If PaddlePaddle is doing some reads and writes in between of transpose and corressponding softmax (more than internal framework) ops then output from transpose may no longer be in a cache and then softmax has to take input from RAM rather than from cache memory so then even if JIT is used it may be working slower due to waiting time to get data from RAM memory. Anyway Hopefully all is fine with JIT softmax performance for ContentDNN.

jianhang-liu commented 5 years ago

@Jacek, so what's the next step in your side to further check MKLDNN Softmax FWD?

jczaja commented 5 years ago

@jianhang-liu Next step should be to enable softmax JIT of PaddlePaddle for this model . If this works very fast then MKL-DNN softmax implementation is poor if JIT softmax is not performing well then I expect that execution is waiting for input data to be fetched from memory and this is some inefficiency in PaddlePaddle. If @tensor-tang is to look at it very soon then I do not have anything to do here and I can only check if MKL-DNN team has some update on their softmax implementation imporovement and test it if possible. Do you have some other suggestions?

jianhang-liu commented 5 years ago

@jczaja @tensor-tang @luotao1 Let's recap where we are after a few days investigation.

With additional log, it's proven proper MKLDNN kernel (DENSE) was called and 99% time of Softmax fwd were already in MKLDNN kernel.
By changing Softmax UT to run on same tensor size as ContextDNN model, it's found that only the 1st run cost similar time (as Softmax in ContextDNN model), but the time cost in later run are significantly decreased. This give us a clue that cache miss may be an issue since it means this OP may be extreamly memory bound with such a tensor size [16, 100, 100]

What's ongoing now:

Jacek: quick hack for JIT Softmax FWD for training (if performance is still not good, it may prove cache miss is true)
TensorTang: work on formal change on JIT Softmax FWD for training

jczaja commented 5 years ago

I'm planning also to partially remove softmax MKL-DNN functionality eg. replace y=e^x with y=x etc. as additional check if we are waiting for memory access.

jczaja commented 5 years ago

@luotao1 Could you please,

Execute profiling of ContentDNN training with mkldnn and MKLDNN_VERBOSE (paddle profiling + mkldnn_verbose) using following branch: https://github.com/jczaja/Paddle/tree/prv-diagnostic-softmax-mkldnn Goal: Check performance imporvment of mkl-dnn softmax when e^x is replaced with memcpy if performance improvment is > 20% then we are compute bound (likely). If performance improvement is minimal Then implementation is memory bound.
Execute profiling of ContentDNN training using Paddle CPU implementation (NO MKL-DNN) https://github.com/jczaja/Paddle/tree/prv-softmax-jit-training Goal: Check performance improvement when using jitted softmax forward.

In both scenarios functional problem will arise eg. convergence problems as those are not production quality branches. We are checking performance.

luotao1 commented 5 years ago

diff --git a/paddle/fluid/operators/softmax_op.h b/paddle/fluid/operators/softmax_op.h
index a964c3b..56b554d 100644
--- a/paddle/fluid/operators/softmax_op.h
+++ b/paddle/fluid/operators/softmax_op.h
@@ -64,15 +64,12 @@ class SoftmaxKernel : public framework::OpKernel<T> {
     X_2d.ShareDataWith(*X).Resize({n, d});
     Out_2d.ShareDataWith(*Out).Resize({n, d});

-#ifdef PADDLE_ON_INFERENCE
     math::SoftmaxFunctor<DeviceContext, T, true>()(
         context.template device_context<DeviceContext>(), axis_dim, &X_2d,
         &Out_2d);
-#else
-    math::SoftmaxFunctor<DeviceContext, T, false>()(
-        context.template device_context<DeviceContext>(), axis_dim, &X_2d,
-        &Out_2d);
-#endif

@jczaja @jianhang-liu @tensor-tang I directly use inference jit softmax fwd, but it is slow as before.

luotao1 commented 5 years ago

Check performance imporvment of mkl-dnn softmax when e^x is replaced with memcpy

The performance is 8x slower than before.

Check performance improvement when using jitted softmax forward.

see https://github.com/PaddlePaddle/Paddle/issues/17268#issuecomment-493844656

jczaja commented 5 years ago

@luotao1 Could you please provide log (MKLDNN_VERBOSE + paddle profiling) for step 1 ?

jczaja commented 5 years ago

@luotao1 Regarding step 1 , 8 x slower with memcpy instead of vsexp is very surprising.

I was able to look into detailed log of contentDNN (VLOG output) training . I looked what happens in-between of transpose2 and corresponding softmax . For slowest one eg. 16x100x100 , we have few other ops execution on diffrent data than what is used in softmax. So we suspected that perhaps buffers executed after transpose2 and before softmax are overwritting data cache and so input data has to be fetched from main memory. But VLOG logs suggest that around ~3 MB of data is processed in between so not enough to evict L3 cache (if only other core is actually doing computation). Question: 1) L3 cache is shared among cores. I'm guessing that no other work happens on entire platform. Could you please make sure no other VM/dockers running on machine where you test and then check performance?

On the other hand I heard from @jianhang-liu that @tensor-tang put internal framework softmax into PaddlePaddle and performance was largly improved. If you could share your findings that would be great

luotao1 commented 5 years ago

Could you please make sure no other VM/dockers running on machine where you test and then check performance?

Yes, no other VM/dockers running.

If you could share your findings that would be great

I have sent email to you.

Could you please provide log (MKLDNN_VERBOSE + paddle profiling) for step 1

For step1, it will cause err after iteration 600. Thus, I paste the log (MKLDNN_VERBOSE + paddle profiling) for iteration 400. memcpy.log From the log, we will find mul_grad cause a lot of times before, but softmax speedups.

before

thread0::mul_grad                           19152       47011       0.010904    26.5146     2.45463     0.18187
thread0::softmax                            4560        18687.2     0.025755    30.3115     4.09807     0.0722947

step1

thread0::mul_grad                           8400        593414      0.011221    1251.54     70.6445     0.884324
thread0::softmax                            2000        313.278     0.030751    0.679793    0.156639    0.000466857

tensor-tang commented 5 years ago

@jczaja @jianhang-liu @tensor-tang I directly use inference jit softmax fwd, but it is slow as before.

@luotao1 actually jit helps a little from @zhupengyang 's test

from 17151 => 13091

The reason why not helps much is that the situation on content dnn have changing size and small ones. jit creating takes time.

The better solution for this case is proposed at #17522

jianhang-liu commented 4 years ago

This issue is closed by PR #17522 and #17534 from @tensor-tang . Meanwhile, Intel MKLDNN team is also working on a JIT version Softmax and will be in v1.0 (and possiblly backport to v0.1x also).

PaddlePaddle / Paddle

Softmax OP (fwd+bwd) need be further optimized #17268