PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

[MKL-DNN] Fully Connected #15226

Closed Sand3r- closed 4 years ago

Sand3r- commented 5 years ago

Introduces Fully Connected operator. It has a comparable performance to the reference version, although it is necessary to be merged, so that the base for the int8 implementation and its performance gains is established.

Update 05.04.2019 Compared to a reference FC, the MKL-DNN version provides:

tensor-tang commented 5 years ago

Could you tell us the difference or improvment over the current fc implementation.

Does it just for infer?

If it is performance improvement, could you give some brief data of before and after?

hshen14 commented 5 years ago

@Sand3r- any update on this?

Sand3r- commented 5 years ago

@Sand3r- any update on this?

@hshen14 As mentioned in an internal mail, Paddle's CI VM yields flawed output for some of the convolutions and fc. This results in insufficient accuracy to have test_analyzer_resnet50 passed. This isn't the case for all of our local machines, where everything works as intended.

CLAassistant commented 5 years ago

CLA assistant check
All committers have signed the CLA.

kbinias commented 5 years ago

The problem with the FC operator is that on CI MKL library is using AVX2 instruction set, while it should use AVX512. We reproduced the problem locally by enforcing AVX2 in MKL using MKL_CBWR=AVX2 environment setting. @luotao1 We need more information about CI running environment. Could you send us info about current CI configuration (results of commands: set and env) ?

luotao1 commented 5 years ago

We need more information about CI running environment.

All the CI machines have AVX512, see https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-451834541 and https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-453375883

Could you send us info about current CI configuration (results of commands: set and env)

Do you mean MKL_CBWR? We don't set this environment. And you can echo $MKL_CBWR yourself. Maybe you can do echo in cmake_gen https://github.com/PaddlePaddle/Paddle/blob/41b8cf0baee3f3d5c4364a858e77c4665ede64ea/paddle/scripts/paddle_build.sh#L55

kbinias commented 5 years ago

@luotao1 I noticed differences between the downloaded versions of the MKLML libraries. CI download: Glibc225_vsErf_mklml_lnx_2019.0.1.20181227 Public PaddlePaddle: mklml_lnx_2019.0.1.20181227 These libraries do not match at the binary level. I am not sure if MKLML from CI support AVX512.

luotao1 commented 5 years ago

Public PaddlePaddle: mklml_lnx_2019.0.1.20181227

https://github.com/PaddlePaddle/Paddle/blob/decdbed054dc649abbc78e0a3405d67321a10315/cmake/external/mklml.cmake#L45 Public Paddle use Glibc225_vsErf_mklml_lnx_2019.0.1.20181227 as well.

These libraries do not match at the binary level. I am not sure if MKLML from CI support AVX512.

@yinghu5 could you help see it?

yinghu5 commented 5 years ago

The problem with the FC operator is that on CI MKL library is using AVX2 instruction set, while it should use AVX512. We reproduced the problem locally by enforcing AVX2 in MKL using MKL_CBWR=AVX2 environment setting. @luotao1 We need more information about CI running environment. Could you send us info about current CI configuration (results of commands: set and env) ?

I build the version Glibc225_vsErf_mklml_lnx_2019.0.1.20181227 based on MKL 2019.0.1 version. It is only add one vsErf function and keep other exactly same as PaddlePaddle: mklml_lnx_2019.0.1.20181227. FC is suppose to use SGEMM function, right? so they suppose same.

@kbinias, your test shows or infer the CI FC calling AVX2 code. could you please try the Glibc225_vsErf_mklml_lnx_2019.0.1.20181227 see if it is different ML issue or MKL CBWR issue.

@luotao1 You mentioned the machine is AVX512. could it possible to export MKL_VERBOSE=1 and MKLDNN_VERBOSE=1 to check the log on questioned CI machine?

luotao1 commented 5 years ago

could it possible to export MKL_VERBOSE=1 and MKLDNN_VERBOSE=1 to check the log on questioned CI machine

@yihuaxu You can add it yourself.

kbinias commented 5 years ago

@yihuaxu You can add it yourself.

@luotao1 You probably meant @yinghu5

Sand3r- commented 5 years ago

@yinghu5 @luotao1 http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&logTab=tree&filter=debug&expand=all&buildId=71204&_focus=17055 Indicates that at least for this particular test, MKLML uses AVX2 kernels, despite of machine supporting AVX512.

yinghu5 commented 5 years ago

@Sand3r- thank a lot to find the Paddle's CI VM MKL are runing AVX2 code. if possible, could you please also show the MKLDNN_VERBOSE log to see if the JIT use AVX2 or AVX512?

or it is CI Virtual Machine issue? and how the mklml linked into the test code? (-L${MKLROOT}/lib -lmklml_intel -liomp5 -ldl -lpthread by default)

I did test one real AVX512 machine Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz the mklml_intel sgemm in Glibc225_vsErf_xxx. run with AVX512 code correctly. please see my log as below: cp Glibc225_vsErf_mklml_lnx_2019.0.1.20181227.tgz ../SGEMM/. cd ../SGEMM/ tar -xzvf Glibc225_vsErf_mklml_lnx_2019.0.1.20181227.tgz

yhu5@dell-r640:~/baidu_mklml/SGEMM$ make gcc sgemm_cblas.cpp -o sgemm_cblas -I/home/yhu5/baidu_mklml/SGEMM/mklml_lnx_2019.0.1.20181227/include -L/home/yhu5/baidu_mklml/SGEMM/mklml_lnx_2019.0.1.20181227/lib/ -lmklml_intel -liomp5 -ldl -lpthread

yhu5@dell-r640:~/baidu_mklml/SGEMM$ ./sgemm_cblas 128 128 128 1 Size of Matrix A(mxk): 128 x 128 lda = 128 Size of Matrix B(kxn): 128 x 128 ldb = 128 Size of Matrix C(mxn): 128 x 128 ldc = 128 LOOP COUNT : 1 MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.10GHz lp64 intel_thread MKL_VERBOSE SGEMM(N,N,128,128,128,0x7ffefbd40d28,0x5560cce6e270,128,0x5560cce5e2 60,128,0x7ffefbd40d30,0x5560cce7e280,128) 34.27ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16 MKL_VERBOSE SGEMM(N,N,128,128,128,0x7ffefbd40d28,0x5560cce6e270,128,0x5560cce5e2 60,128,0x7ffefbd40d30,0x5560cce7e280,128) 33.88us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16 m=128,n=128,k=128 cores=16 gflops=71.08990 peak=2150.39990 efficiency=3.31% Average time:0.059000 msecs GFlop :0.00419 GFlop/sec :71.08990 GFlops yhu5@dell-r640:~/baidu_mklml/SGEMM$ ldd ./sgemm_cblas linux-vdso.so.1 (0x00007ffc90920000) libmklml_intel.so => ./mklml_lnx_2019.0.1.20181227/lib/libmklml_intel.so (0x00007fd116270000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd115e7f000) libiomp5.so => ./mklml_lnx_2019.0.1.20181227/lib/libiomp5.so (0x00007fd115aa3000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd115884000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd1154e6000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fd1152e2000) /lib64/ld-linux-x86-64.so.2 (0x00007fd11e183000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fd1150ca000) yhu5@dell-r640:~/baidu_mklml/SGEMM$ nm ./mklml_lnx_2019.0.1.20181227/lib/libmklml_intel.so | grep vsErf 00000000001b7300 T vsErf yhu5@dell-r640:~/baidu_mklml/SGEMM$ nm ./mklml_lnx_2019.0.1.20181227/lib/libmklml_intel.so | grep memcpy U memcpy@GLIBC_2.2.5 00000000004374e0 T mkl_serv_memcpy_s 0000000001b29e80 T mkl_serv_memcpy_unbounded_s 00000000001a56e5 T __wrap_memcpy yhu5@dell-r640:~/baidu_mklml/SGEMM$

jczaja commented 5 years ago

@yinghu5

Please find log with MKL_VERBOSE=1, MKDNN_VERBOSE=1 : http://ci.paddlepaddle.org/viewLog.html?buildId=67574&buildTypeId=Paddle_PrCi&tab=buildLog

I looked to that problem , so here are some additional information:

Next Steps: 1) Currently this PR conflicted with develop, so we cannot easily trigger new experiments without resolving conflict (hopefully we will get to do that in a couple of days) . When We have all resolved, then it maybe useful to make a change to use old MKLML eg. one without vsErf support .

yinghu5 commented 5 years ago

@jczaja @Sand3r- we (yihua and me) made experiment : download the Erf_MKLML from CI and enforce local paddle to use it on one SKX AVX512 supported machine. All fine, no reproduction of problem.
@luotao1 So the issue may be in CI VM environment, which just enable AVX2?

luotao1 commented 5 years ago

@yihuaxu @yinghu5 How about using the official mklml lib and print verbose log on CI?

Sand3r- commented 5 years ago

@yihuaxu @yinghu5 How about using the official mklml lib and print verbose log on CI?

As far as I remember, I tried it and I figured it wouldn't link due to missing functions.

luotao1 commented 5 years ago

You can test with revert #15770

jczaja commented 5 years ago

I agree with @luotao1 . Changes should be tested against official MKLML even if it means to revert vsERF dependant code.

@yinghu5 As you can see in a log I referenced (two messages of mine up) MKL-DNN_VERBOSE is reporting AVX512 formats of primitives to be used. So Platform seem to be capable of AVX512.

yihuaxu commented 5 years ago

@yihuaxu @yinghu5 How about using the official mklml lib and print verbose log on CI?

@luotao1 Just I tried to enable MKL_VERBOSE with offical version of mklml, it show that it only enable AVX2 in your CI environment.

URL: http://ci.paddlepaddle.org/viewLog.html?buildId=73374&buildTypeId=Paddle_PrCi&tab=buildLog Log: http://ci.paddlepaddle.org/downloadBuildLog.html?buildId=73374 Diff: diff --git a/cmake/external/mklml.cmake b/cmake/external/mklml.cmake index ae2679d..472eb7d 100644 --- a/cmake/external/mklml.cmake +++ b/cmake/external/mklml.cmake @@ -42,7 +42,7 @@ IF(WIN32) ELSE()

TODO(intel-huying):

 #  Now enable Erf function in mklml library temporarily, it will be updated as offical version later.

- SET(MKLML_VER "Glibc225_vsErf_mklmllnx${TIME_VERSION}" CACHE STRING "" FORCE) + SET(MKLML_VER "mklmllnx${TIME_VERSION}" CACHE STRING "" FORCE) SET(MKLML_URL "http://paddlepaddledeps.cdn.bcebos.com/${MKLML_VER}.tgz" CACHE STRING "" FORCE) SET(MKLML_LIB ${MKLML_LIB_DIR}/libmklml_intel.so) SET(MKLML_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5.so)

Log: [06:10:34] : [Step 1/1] -- Performing Test AVX512F_FOUND [06:10:35] : [Step 1/1] -- Performing Test AVX512F_FOUND - Success <~snip~> [06:32:06] : [Step 1/1] 569: MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20181227 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.00GHz lp64 intel_thread [06:32:06] : [Step 1/1] 569: MKL_VERBOSE SGEMM(N,N,32,256,32,0x7ffcd7c174d8,0x7f09bd640040,32,0x7f09bd709040,32,0x7ffcd7c174e0,0x7f09bd712040,32) 27.24ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

yinghu5 commented 5 years ago

we got one detect tool from MKLDNN engineer team to detect CPUID Flags. Yihua did the test on CI machine and disclose the root cause :
There is a misconfiguration of CPUID leaf 7. MKL-DNN dispatches to avx512_common if it detects AVX512F bit set and does not check for any other bits. MKL checks for the full set corresponding to either Xeon or Xeon Phi.

for example, under SKX support machine, even Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz Sample output, all AVX512 FLAG should be 1: you can see them by $ lscpu or the detect tools

$ ./leaf7_test leaf 07H: eax=0x0 ebx=0xd39ffffb ecx=0x8, edx=0x0 AVX2 1 AVX512F 1 AVX512DQ 1 AVX512CD 1 AVX512BW 1 AVX512VL 1

But in CI machine: Yihua give the result as below and please see the FLAGS of AVX512 like AVX512DQ, AVX512VL etc is not on. the detailed can be downloaded from http://ci.paddlepaddle.org/downloadBuildLog.html?buildId=74563&archived=true .

Log: [04:39:51]checking leaf7 [04:39:51]leaf 07H: eax=0x0 ebx=0x101d4fbb ecx=0x0, edx=0x0 [04:39:51]AVX2 1 [04:39:51]AVX512F 1 [04:39:51]AVX512DQ 0 [04:39:51]AVX512CD 1 [04:39:51]AVX512BW 0 [04:39:51]AVX512VL 0

Difference: https://github.com/PaddlePaddle/Paddle/pull/16077/commits/39f628ac0224fb191db61fa217bf04a6f1134ef7

So @luotao, may we connect the machine owner and Jason to see why those of CPU flags was switch off. once switch on, then the issue should be gone.

yihuaxu commented 5 years ago

we got one detect tool from MKLDNN engineer team to detect CPUID Flags. Yihua did the test on CI machine and disclose the root cause : There is a misconfiguration of CPUID leaf 7. MKL-DNN dispatches to avx512_common if it detects AVX512F bit set and does not check for any other bits. MKL checks for the full set corresponding to either Xeon or Xeon Phi.

for example, under SKX support machine, even Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz Sample output, all AVX512 FLAG should be 1: you can see them by $ lscpu or the detect tools

$ ./leaf7_test leaf 07H: eax=0x0 ebx=0xd39ffffb ecx=0x8, edx=0x0 AVX2 1 AVX512F 1 AVX512DQ 1 AVX512CD 1 AVX512BW 1 AVX512VL 1

But in CI machine: Yihua give the result as below and please see the FLAGS of AVX512 like AVX512DQ, AVX512VL etc is not on. the detailed can be downloaded from http://ci.paddlepaddle.org/downloadBuildLog.html?buildId=74563&archived=true .

Log: [04:39:51]checking leaf7 [04:39:51]leaf 07H: eax=0x0 ebx=0x101d4fbb ecx=0x0, edx=0x0 [04:39:51]AVX2 1 [04:39:51]AVX512F 1 [04:39:51]AVX512DQ 0 [04:39:51]AVX512CD 1 [04:39:51]AVX512BW 0 [04:39:51]AVX512VL 0

Difference: 39f628a

So @luotao, may we connect the machine owner and Jason to see why those of CPU flags was switch off. once switch on, then the issue should be gone.

@luotao1 CPU information via lscpu command.

[04:39:51] + lscpu [04:39:51] Architecture: x86_64 [04:39:51] CPU op-mode(s): 32-bit, 64-bit [04:39:51] Byte Order: Little Endian [04:39:51] + echo checking leaf7 [04:39:51] CPU(s): 26 [04:39:51] + paddle/fluid/operators/jit/leaf7_test/leaf7_test [04:39:51] On-line CPU(s) list: 0-25 [04:39:51] Thread(s) per core: 1 [04:39:51] Core(s) per socket: 1 [04:39:51] Socket(s): 26 [04:39:51] NUMA node(s): 1 [04:39:51] Vendor ID: GenuineIntel [04:39:51] CPU family: 6 [04:39:51] Model: 85 [04:39:51] Model name: Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz [04:39:51] Stepping: 4 [04:39:51] + echo test_recognize_digits testing [04:39:51] CPU MHz: 2000.004 [04:39:51] BogoMIPS: 4000.00 [04:39:51] + MKL_VERBOSE=1 [04:39:51] Hypervisor vendor: KVM [04:39:51] + MKLDNN_VERBOSE=1 [04:39:51] Virtualization type: full [04:39:51] L1d cache: 32K [04:39:51] + GLOG_v=1 [04:39:51] + GLOG_logtostderr=1 [04:39:51] + ctest -R test_recognize_digits -V [04:39:51] L1i cache: 32K [04:39:51] L2 cache: 4096K [04:39:51] NUMA node0 CPU(s): 0-25 [04:39:51] Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch arat fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap avx512cd xsaveopt xsavec xgetbv1 [04:39:51] checking leaf7 [04:39:51] leaf 07H: eax=0x0 ebx=0x101d4fbb ecx=0x0, edx=0x0 [04:39:51] AVX2 1 [04:39:51] AVX512F 1 [04:39:51] AVX512DQ 0 [04:39:51] AVX512CD 1 [04:39:51] AVX512BW 0 [04:39:51] AVX512VL 0

As detail, please refer to http://ci.paddlepaddle.org/viewLog.html?buildId=74563&buildTypeId=Paddle_PrCi&tab=buildLog&_focus=23466#_state=94.

tensor-tang commented 5 years ago

Thanks your great help @yinghu5 @yihuaxu

@Sand3r- @jczaja As we can guarantee that the CI machine(5117) can run AVX512 and the reason why MKL can only use AVX2 is that the cpu flags are incomplete somehow(we guess maybe BIOS? Kernel version?), shown below image.

so the suggestion is that we can add export MKL_DEBUG_CPU_TYPE=7 to force MKL run with AVX512. And the experiment shows it works: Before: image

After: image

So you cloud add this export in cmakefile of this unit test to pass CI. like this:

foreach(TEST_OP ${TEST_OPS})
    py_test_modules(${TEST_OP} MODULES ${TEST_OP}  ENVS FLAGS_use_ngraph=true)
endforeach(TEST_OP)
Sand3r- commented 5 years ago

@tensor-tang I have added the flag to CI (to paddle_build.sh) but now test_analyzer_seq_pool1 starts failing, because it's accuracy result was only passing when MKL AVX2 function implementations were run. I have confirmed that by running following commands on a local machine (properly configured, AVX512 available):

ctest -R test_analyzer_seq_pool1 - test fails MKL_CBWR=AVX2 ctest -R test_analyzer_seq_pool1 - test passes.

Sand3r- commented 5 years ago

@luotao1 I need you approval upon modification of CMakeList so that the Paddle_CI can pass. http://ci.paddlepaddle.org/viewLog.html?buildId=87555&buildTypeId=Paddle_PrCi&tab=buildLog&_focus=26550

Apart from that all tests seem to be passing on CI and the framework builds just fine. Could you please look into it?

luotao1 commented 5 years ago

I need you approval upon modification of CMakeList so that the Paddle_CI can pass

Got it. We can review at first. @tensor-tang

Sand3r- commented 5 years ago

Got it. We can review at first. @tensor-tang

Sure thing. If you need any additional explanation, feel free to ask me :).

luotao1 commented 5 years ago

51% speedup on SKX 6148

  1. Do you have benchmark on AVX2 machine, such as 2620 v3? @Sand3r-
  2. For bert_emb768 model, do you test this PR? @jianhang-liu @yihuaxu
Sand3r- commented 5 years ago
  1. Do you have benchmark on AVX2 machine, such as 2620 v3? @Sand3r-

I have requested our Validation&Benchmarking team to prepare the measurements on 2699 v4. They should be posted here soon.

Sand3r- commented 5 years ago
  1. For bert_emb768 model, do you test this PR? @jianhang-liu @yihuaxu

@jczaja has been running the test on test_analyzer_bert.

Currently used version of MKL-DNN in Paddle only supports 2 and 4 dimensional inputs. Hence, the Pass that enables MKL-DNN's FC checks if the dimensions are either 2 or 4.

In BERT used from analyzer test, there were 2 out of 74 fully connected layers that were converted to MKL-DNN kernels. This indicates that 72 other fcs had their dimension sizes different from 2 or 4. Hence the performance boost wasn't as significant as it was in for example, ResNeXt-50.

However, as of MKL-DNN 1.0 the 3 and 5 dimensional input for FC is supported, so once Paddle will be updated to the 1.0, I can introduce the PR that would modify the FC kernel accordingly (so that it can perform computations on 3 dim inputs as well). This will allow for converting all FC layers from BERT to be MKL-DNN FC layers.

luotao1 commented 5 years ago

Thanks very much for your detail analysis! @Sand3r-

as of MKL-DNN 1.0 the 3 and 5 dimensional input for FC is supported

When will MKL-DNN 1.0 release?

This will allow for converting all FC layers from BERT to be MKL-DNN FC layers.

Could you estimate how much speedup when converting all FC layers to MKL-DNN FC layers?

ghost commented 5 years ago

Intel AI Benchmark Team confirmed the results.

I measured the speed-up by comparing profiling reports from resnext50 run on HSW-e5-2669, I got the following results regarding the fc layer:

Sand3r- commented 5 years ago

When will MKL-DNN 1.0 release?

The estimate of end of Q2 was mentioned, but I haven't heared about any hard deadlines.

Could you estimate how much speedup when converting all FC layers to MKL-DNN FC layers?

According to my estimates, assuming that when BERT is ran on MKL-DNN environement where fc takes up 28.5% of total execution time, the model should gain an overall speedup of 10% on inference.

luotao1 commented 5 years ago

so that it can perform computations on 3 dim inputs as well

@Sand3r- @jczaja Does #16880 solve this problem?

Sand3r- commented 5 years ago

so that it can perform computations on 3 dim inputs as well

@Sand3r- @jczaja Does #16880 solve this problem?

Nope, that's irrelevant. in the current version of Paddle's MKL-DNN there is a check when creating an FC primitive that ensures that the input is not 3 dimensional. If it is, the primitive is not created and an exception of mkldnn_unimplemented is thrown.

jczaja commented 5 years ago

@luotao1 As @Sand3r- mentioned #16880 is a unrelated to FC. But you are very much welcomed to review #16880 :)

Sand3r- commented 5 years ago

@luotao1 @tensor-tang Is there anything else I could explain or clear up?

yihuaxu commented 5 years ago

51% speedup on SKX 6148

  1. Do you have benchmark on AVX2 machine, such as 2620 v3? @Sand3r-
  2. For bert_emb768 model, do you test this PR? @jianhang-liu @yihuaxu

Since in bert model the input dimension is not 2 or 4, it can not get more gain from this change.

Sand3r- commented 5 years ago

Weights transposition has been moved to kernel now. All tests pass and PR just waits for approval so that PR_CI_CPU_Py2 (Compile) can Pass.

Sand3r- commented 5 years ago

Could you update two questions?

  1. https://github.com/PaddlePaddle/Paddle/pull/15226/files#r278007464 Since we have fc_fuse_pass and mkldnn_placement_pass, why we need fc_mkldnn_pass? After fc_fuse_pass and mkldnn_placement_pass, we could call mkldnn kernel of fc_op.
  2. https://github.com/PaddlePaddle/Paddle/pull/15226/files#r278005465 Since we have PDNode *patterns::FC::operator(), why we need PDNode *patterns::FCMKLDNN::operator again? We don't have PDNode *patterns::xxxMKLDNN:: operator() in this file.

Done.

luotao1 commented 5 years ago

I am still confused: After fc_fuse_pass and mkldnn_placement_pass, we could call mkldnn kernel of fc_op? Should we need fc_mkldnn_pass.cc?

Sand3r- commented 5 years ago

What happens is essentially this:

  1. MKL-DNN Placement pass is ran and applies use_mkldnn on all existing operators. During this stage, there is no FC Operator that exists, because muls and elementwise_adds haven't been merged yet.
  2. (some other passes run)
  3. fc_fuse_pass runs merging muls and elementwise_adds into FC Operator. Even though the muls and elementwise_adds had their use_mkldnn attributes set to true, fc_fuse_pass ignores that fact and leaves the parameter of its use_mkldnn to the default value which is false.

Thanks to that behaviour, when fc_mkldnn_pass is ran after fc_fuse_pass, it can turn on MKL-DNN's FC. That way we can turn MKL-DNN's FC on and off depending on the topology needs.

luotao1 commented 5 years ago

Thanks for your explanation! I see mkldnn_placement_pass is at the beginning of all the passes. Thus, if we call mkldnn_placement_pass after fc_fuse_pass again, that's to say, call mkldnn_placement_pass before depthwise_conv_mkldnn_pass again , should we remove fc_mkldnn_pass? https://github.com/PaddlePaddle/Paddle/blob/490e746269b11fefac4e23f7fad2a096f94b7c31/paddle/fluid/inference/api/paddle_pass_builder.cc#L148-L152

Sand3r- commented 5 years ago

Thanks for your explanation! I see mkldnn_placement_pass is at the beginning of all the passes. Thus, if we call mkldnn_placement_pass after fc_fuse_pass again, that's to say, call mkldnn_placement_pass before depthwise_conv_mkldnn_pass again , should we remove fc_mkldnn_pass? https://github.com/PaddlePaddle/Paddle/blob/490e746269b11fefac4e23f7fad2a096f94b7c31/paddle/fluid/inference/api/paddle_pass_builder.cc#L148-L152

Well, yes, but that takes away the possibility to have this operator turned on or off on demand. As far as I recall that was one of the desired features.

luotao1 commented 5 years ago

Well, yes, but that takes away the possibility to have this operator turned on or off on demand. As far as I recall that was one of the desired features.

Got it. If fc_mkldnn_pass default off, how about use cfg->pass_builder()->AppendPass(“mkldnn_placement_pass”) to enable fc_mkldnn_op?

luotao1 commented 5 years ago

image We would reduce 160 lines if we could use mkldnn_placement_pass to enable fc_mkldnn_op.

Sand3r- commented 5 years ago

We would reduce 160 lines if we could use mkldnn_placement_pass to enable fc_mkldnn_op.

With just mkldnn_placement which runs as the first pass that would not be possible as far as I know. One easy way to do it would be to modify fc_fuse pass to take use_mkldnn flag intro consideration when Paddle is built with MKL-DNN.

However as I've mentioned, you will no longer be able to control if you want the MKL-DNN's fc to be enabled or not. Depending on matrix sizes and whether FC is chained with other MKL-DNN operators sometimes it might be more beneficial to use Paddle's (non-mkldnn's) fc.

luotao1 commented 5 years ago

With just mkldnn_placement which runs as the first pass

AppendPass(“mkldnn_placement_pass”) means that add mkldnn_placement as the last pass.

luotao1 commented 5 years ago

@tensor-tang @jianhang-liu How do you think about https://github.com/PaddlePaddle/Paddle/pull/15226#issuecomment-492987622?

Sand3r- commented 5 years ago

I just wanted to mention that the pattern detector for detecting FC operator will need to appear anyway when implementing int8 version, because it is necessary for Quantization using INT8 V2 mechanism.