Closed Sand3r- closed 4 years ago
Could you tell us the difference or improvment over the current fc implementation.
Does it just for infer?
If it is performance improvement, could you give some brief data of before and after?
@Sand3r- any update on this?
@Sand3r- any update on this?
@hshen14 As mentioned in an internal mail, Paddle's CI VM yields flawed output for some of the convolutions and fc. This results in insufficient accuracy to have test_analyzer_resnet50 passed. This isn't the case for all of our local machines, where everything works as intended.
The problem with the FC operator is that on CI MKL library is using AVX2 instruction set, while it should use AVX512. We reproduced the problem locally by enforcing AVX2 in MKL using MKL_CBWR=AVX2
environment setting.
@luotao1 We need more information about CI running environment. Could you send us info about current CI configuration (results of commands: set
and env
) ?
We need more information about CI running environment.
All the CI machines have AVX512, see https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-451834541 and https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-453375883
Could you send us info about current CI configuration (results of commands: set and env)
Do you mean MKL_CBWR
? We don't set this environment. And you can echo $MKL_CBWR
yourself. Maybe you can do echo
in cmake_gen
https://github.com/PaddlePaddle/Paddle/blob/41b8cf0baee3f3d5c4364a858e77c4665ede64ea/paddle/scripts/paddle_build.sh#L55
@luotao1 I noticed differences between the downloaded versions of the MKLML libraries.
CI download: Glibc225_vsErf_mklml_lnx_2019.0.1.20181227
Public PaddlePaddle: mklml_lnx_2019.0.1.20181227
These libraries do not match at the binary level.
I am not sure if MKLML from CI support AVX512.
Public PaddlePaddle: mklml_lnx_2019.0.1.20181227
https://github.com/PaddlePaddle/Paddle/blob/decdbed054dc649abbc78e0a3405d67321a10315/cmake/external/mklml.cmake#L45
Public Paddle use Glibc225_vsErf_mklml_lnx_2019.0.1.20181227
as well.
These libraries do not match at the binary level. I am not sure if MKLML from CI support AVX512.
@yinghu5 could you help see it?
The problem with the FC operator is that on CI MKL library is using AVX2 instruction set, while it should use AVX512. We reproduced the problem locally by enforcing AVX2 in MKL using
MKL_CBWR=AVX2
environment setting. @luotao1 We need more information about CI running environment. Could you send us info about current CI configuration (results of commands:set
andenv
) ?
I build the version Glibc225_vsErf_mklml_lnx_2019.0.1.20181227 based on MKL 2019.0.1 version. It is only add one vsErf function and keep other exactly same as PaddlePaddle: mklml_lnx_2019.0.1.20181227. FC is suppose to use SGEMM function, right? so they suppose same.
@kbinias, your test shows or infer the CI FC calling AVX2 code. could you please try the Glibc225_vsErf_mklml_lnx_2019.0.1.20181227 see if it is different ML issue or MKL CBWR issue.
@luotao1 You mentioned the machine is AVX512. could it possible to export MKL_VERBOSE=1 and MKLDNN_VERBOSE=1 to check the log on questioned CI machine?
could it possible to export MKL_VERBOSE=1 and MKLDNN_VERBOSE=1 to check the log on questioned CI machine
@yihuaxu You can add it yourself.
@yihuaxu You can add it yourself.
@luotao1 You probably meant @yinghu5
@yinghu5 @luotao1 http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&logTab=tree&filter=debug&expand=all&buildId=71204&_focus=17055 Indicates that at least for this particular test, MKLML uses AVX2 kernels, despite of machine supporting AVX512.
@Sand3r- thank a lot to find the Paddle's CI VM MKL are runing AVX2 code. if possible, could you please also show the MKLDNN_VERBOSE log to see if the JIT use AVX2 or AVX512?
or it is CI Virtual Machine issue? and how the mklml linked into the test code? (-L${MKLROOT}/lib -lmklml_intel -liomp5 -ldl -lpthread by default)
I did test one real AVX512 machine Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz the mklml_intel sgemm in Glibc225_vsErf_xxx. run with AVX512 code correctly. please see my log as below: cp Glibc225_vsErf_mklml_lnx_2019.0.1.20181227.tgz ../SGEMM/. cd ../SGEMM/ tar -xzvf Glibc225_vsErf_mklml_lnx_2019.0.1.20181227.tgz
yhu5@dell-r640:~/baidu_mklml/SGEMM$ make gcc sgemm_cblas.cpp -o sgemm_cblas -I/home/yhu5/baidu_mklml/SGEMM/mklml_lnx_2019.0.1.20181227/include -L/home/yhu5/baidu_mklml/SGEMM/mklml_lnx_2019.0.1.20181227/lib/ -lmklml_intel -liomp5 -ldl -lpthread
yhu5@dell-r640:~/baidu_mklml/SGEMM$ ./sgemm_cblas 128 128 128 1 Size of Matrix A(mxk): 128 x 128 lda = 128 Size of Matrix B(kxn): 128 x 128 ldb = 128 Size of Matrix C(mxn): 128 x 128 ldc = 128 LOOP COUNT : 1 MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.10GHz lp64 intel_thread MKL_VERBOSE SGEMM(N,N,128,128,128,0x7ffefbd40d28,0x5560cce6e270,128,0x5560cce5e2 60,128,0x7ffefbd40d30,0x5560cce7e280,128) 34.27ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16 MKL_VERBOSE SGEMM(N,N,128,128,128,0x7ffefbd40d28,0x5560cce6e270,128,0x5560cce5e2 60,128,0x7ffefbd40d30,0x5560cce7e280,128) 33.88us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16 m=128,n=128,k=128 cores=16 gflops=71.08990 peak=2150.39990 efficiency=3.31% Average time:0.059000 msecs GFlop :0.00419 GFlop/sec :71.08990 GFlops yhu5@dell-r640:~/baidu_mklml/SGEMM$ ldd ./sgemm_cblas linux-vdso.so.1 (0x00007ffc90920000) libmklml_intel.so => ./mklml_lnx_2019.0.1.20181227/lib/libmklml_intel.so (0x00007fd116270000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd115e7f000) libiomp5.so => ./mklml_lnx_2019.0.1.20181227/lib/libiomp5.so (0x00007fd115aa3000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd115884000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd1154e6000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fd1152e2000) /lib64/ld-linux-x86-64.so.2 (0x00007fd11e183000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fd1150ca000) yhu5@dell-r640:~/baidu_mklml/SGEMM$ nm ./mklml_lnx_2019.0.1.20181227/lib/libmklml_intel.so | grep vsErf 00000000001b7300 T vsErf yhu5@dell-r640:~/baidu_mklml/SGEMM$ nm ./mklml_lnx_2019.0.1.20181227/lib/libmklml_intel.so | grep memcpy U memcpy@GLIBC_2.2.5 00000000004374e0 T mkl_serv_memcpy_s 0000000001b29e80 T mkl_serv_memcpy_unbounded_s 00000000001a56e5 T __wrap_memcpy yhu5@dell-r640:~/baidu_mklml/SGEMM$
@yinghu5
Please find log with MKL_VERBOSE=1, MKDNN_VERBOSE=1 : http://ci.paddlepaddle.org/viewLog.html?buildId=67574&buildTypeId=Paddle_PrCi&tab=buildLog
I looked to that problem , so here are some additional information:
Next Steps: 1) Currently this PR conflicted with develop, so we cannot easily trigger new experiments without resolving conflict (hopefully we will get to do that in a couple of days) . When We have all resolved, then it maybe useful to make a change to use old MKLML eg. one without vsErf support .
@jczaja @Sand3r- we (yihua and me) made experiment : download the Erf_MKLML from CI and enforce local paddle to use it on one SKX AVX512 supported machine. All fine, no reproduction of problem.
@luotao1 So the issue may be in CI VM environment, which just enable AVX2?
@yihuaxu @yinghu5 How about using the official mklml lib and print verbose log on CI?
@yihuaxu @yinghu5 How about using the official mklml lib and print verbose log on CI?
As far as I remember, I tried it and I figured it wouldn't link due to missing functions.
You can test with revert #15770
I agree with @luotao1 . Changes should be tested against official MKLML even if it means to revert vsERF dependant code.
@yinghu5 As you can see in a log I referenced (two messages of mine up) MKL-DNN_VERBOSE is reporting AVX512 formats of primitives to be used. So Platform seem to be capable of AVX512.
@yihuaxu @yinghu5 How about using the official mklml lib and print verbose log on CI?
@luotao1 Just I tried to enable MKL_VERBOSE with offical version of mklml, it show that it only enable AVX2 in your CI environment.
URL: http://ci.paddlepaddle.org/viewLog.html?buildId=73374&buildTypeId=Paddle_PrCi&tab=buildLog Log: http://ci.paddlepaddle.org/downloadBuildLog.html?buildId=73374 Diff: diff --git a/cmake/external/mklml.cmake b/cmake/external/mklml.cmake index ae2679d..472eb7d 100644 --- a/cmake/external/mklml.cmake +++ b/cmake/external/mklml.cmake @@ -42,7 +42,7 @@ IF(WIN32) ELSE()
# Now enable Erf function in mklml library temporarily, it will be updated as offical version later.
- SET(MKLML_VER "Glibc225_vsErf_mklmllnx${TIME_VERSION}" CACHE STRING "" FORCE) + SET(MKLML_VER "mklmllnx${TIME_VERSION}" CACHE STRING "" FORCE) SET(MKLML_URL "http://paddlepaddledeps.cdn.bcebos.com/${MKLML_VER}.tgz" CACHE STRING "" FORCE) SET(MKLML_LIB ${MKLML_LIB_DIR}/libmklml_intel.so) SET(MKLML_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5.so)
Log: [06:10:34] : [Step 1/1] -- Performing Test AVX512F_FOUND [06:10:35] : [Step 1/1] -- Performing Test AVX512F_FOUND - Success <~snip~> [06:32:06] : [Step 1/1] 569: MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20181227 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.00GHz lp64 intel_thread [06:32:06] : [Step 1/1] 569: MKL_VERBOSE SGEMM(N,N,32,256,32,0x7ffcd7c174d8,0x7f09bd640040,32,0x7f09bd709040,32,0x7ffcd7c174e0,0x7f09bd712040,32) 27.24ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
we got one detect tool from MKLDNN engineer team to detect CPUID Flags. Yihua did the test on CI machine and disclose the root cause :
There is a misconfiguration of CPUID leaf 7. MKL-DNN dispatches to avx512_common if it detects AVX512F bit set and does not check for any other bits. MKL checks for the full set corresponding to either Xeon or Xeon Phi.
for example, under SKX support machine, even Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz Sample output, all AVX512 FLAG should be 1: you can see them by $ lscpu or the detect tools
$ ./leaf7_test leaf 07H: eax=0x0 ebx=0xd39ffffb ecx=0x8, edx=0x0 AVX2 1 AVX512F 1 AVX512DQ 1 AVX512CD 1 AVX512BW 1 AVX512VL 1
But in CI machine: Yihua give the result as below and please see the FLAGS of AVX512 like AVX512DQ, AVX512VL etc is not on. the detailed can be downloaded from http://ci.paddlepaddle.org/downloadBuildLog.html?buildId=74563&archived=true .
Log: [04:39:51]checking leaf7 [04:39:51]leaf 07H: eax=0x0 ebx=0x101d4fbb ecx=0x0, edx=0x0 [04:39:51]AVX2 1 [04:39:51]AVX512F 1 [04:39:51]AVX512DQ 0 [04:39:51]AVX512CD 1 [04:39:51]AVX512BW 0 [04:39:51]AVX512VL 0
Difference: https://github.com/PaddlePaddle/Paddle/pull/16077/commits/39f628ac0224fb191db61fa217bf04a6f1134ef7
So @luotao, may we connect the machine owner and Jason to see why those of CPU flags was switch off. once switch on, then the issue should be gone.
we got one detect tool from MKLDNN engineer team to detect CPUID Flags. Yihua did the test on CI machine and disclose the root cause : There is a misconfiguration of CPUID leaf 7. MKL-DNN dispatches to avx512_common if it detects AVX512F bit set and does not check for any other bits. MKL checks for the full set corresponding to either Xeon or Xeon Phi.
for example, under SKX support machine, even Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz Sample output, all AVX512 FLAG should be 1: you can see them by $ lscpu or the detect tools
$ ./leaf7_test leaf 07H: eax=0x0 ebx=0xd39ffffb ecx=0x8, edx=0x0 AVX2 1 AVX512F 1 AVX512DQ 1 AVX512CD 1 AVX512BW 1 AVX512VL 1
But in CI machine: Yihua give the result as below and please see the FLAGS of AVX512 like AVX512DQ, AVX512VL etc is not on. the detailed can be downloaded from http://ci.paddlepaddle.org/downloadBuildLog.html?buildId=74563&archived=true .
Log: [04:39:51]checking leaf7 [04:39:51]leaf 07H: eax=0x0 ebx=0x101d4fbb ecx=0x0, edx=0x0 [04:39:51]AVX2 1 [04:39:51]AVX512F 1 [04:39:51]AVX512DQ 0 [04:39:51]AVX512CD 1 [04:39:51]AVX512BW 0 [04:39:51]AVX512VL 0
Difference: 39f628a
So @luotao, may we connect the machine owner and Jason to see why those of CPU flags was switch off. once switch on, then the issue should be gone.
@luotao1 CPU information via lscpu command.
[04:39:51] + lscpu [04:39:51] Architecture: x86_64 [04:39:51] CPU op-mode(s): 32-bit, 64-bit [04:39:51] Byte Order: Little Endian [04:39:51] + echo checking leaf7 [04:39:51] CPU(s): 26 [04:39:51] + paddle/fluid/operators/jit/leaf7_test/leaf7_test [04:39:51] On-line CPU(s) list: 0-25 [04:39:51] Thread(s) per core: 1 [04:39:51] Core(s) per socket: 1 [04:39:51] Socket(s): 26 [04:39:51] NUMA node(s): 1 [04:39:51] Vendor ID: GenuineIntel [04:39:51] CPU family: 6 [04:39:51] Model: 85 [04:39:51] Model name: Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz [04:39:51] Stepping: 4 [04:39:51] + echo test_recognize_digits testing [04:39:51] CPU MHz: 2000.004 [04:39:51] BogoMIPS: 4000.00 [04:39:51] + MKL_VERBOSE=1 [04:39:51] Hypervisor vendor: KVM [04:39:51] + MKLDNN_VERBOSE=1 [04:39:51] Virtualization type: full [04:39:51] L1d cache: 32K [04:39:51] + GLOG_v=1 [04:39:51] + GLOG_logtostderr=1 [04:39:51] + ctest -R test_recognize_digits -V [04:39:51] L1i cache: 32K [04:39:51] L2 cache: 4096K [04:39:51] NUMA node0 CPU(s): 0-25 [04:39:51] Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch arat fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap avx512cd xsaveopt xsavec xgetbv1 [04:39:51] checking leaf7 [04:39:51] leaf 07H: eax=0x0 ebx=0x101d4fbb ecx=0x0, edx=0x0 [04:39:51] AVX2 1 [04:39:51] AVX512F 1 [04:39:51] AVX512DQ 0 [04:39:51] AVX512CD 1 [04:39:51] AVX512BW 0 [04:39:51] AVX512VL 0
As detail, please refer to http://ci.paddlepaddle.org/viewLog.html?buildId=74563&buildTypeId=Paddle_PrCi&tab=buildLog&_focus=23466#_state=94.
Thanks your great help @yinghu5 @yihuaxu
@Sand3r- @jczaja As we can guarantee that the CI machine(5117) can run AVX512 and the reason why MKL can only use AVX2 is that the cpu flags are incomplete somehow(we guess maybe BIOS? Kernel version?), shown below .
so the suggestion is that we can add export MKL_DEBUG_CPU_TYPE=7
to force MKL run with AVX512. And the experiment shows it works:
Before:
After:
So you cloud add this export in cmakefile of this unit test to pass CI. like this:
foreach(TEST_OP ${TEST_OPS})
py_test_modules(${TEST_OP} MODULES ${TEST_OP} ENVS FLAGS_use_ngraph=true)
endforeach(TEST_OP)
@tensor-tang I have added the flag to CI (to paddle_build.sh) but now test_analyzer_seq_pool1
starts failing, because it's accuracy result was only passing when MKL AVX2 function implementations were run. I have confirmed that by running following commands on a local machine (properly configured, AVX512 available):
ctest -R test_analyzer_seq_pool1
- test fails
MKL_CBWR=AVX2 ctest -R test_analyzer_seq_pool1
- test passes.
@luotao1 I need you approval upon modification of CMakeList so that the Paddle_CI can pass. http://ci.paddlepaddle.org/viewLog.html?buildId=87555&buildTypeId=Paddle_PrCi&tab=buildLog&_focus=26550
Apart from that all tests seem to be passing on CI and the framework builds just fine. Could you please look into it?
I need you approval upon modification of CMakeList so that the Paddle_CI can pass
Got it. We can review at first. @tensor-tang
Got it. We can review at first. @tensor-tang
Sure thing. If you need any additional explanation, feel free to ask me :).
51% speedup on SKX 6148
- Do you have benchmark on AVX2 machine, such as 2620 v3? @Sand3r-
I have requested our Validation&Benchmarking team to prepare the measurements on 2699 v4. They should be posted here soon.
- For bert_emb768 model, do you test this PR? @jianhang-liu @yihuaxu
@jczaja has been running the test on test_analyzer_bert
.
Currently used version of MKL-DNN in Paddle only supports 2 and 4 dimensional inputs. Hence, the Pass that enables MKL-DNN's FC checks if the dimensions are either 2 or 4.
In BERT used from analyzer test, there were 2 out of 74 fully connected layers that were converted to MKL-DNN kernels. This indicates that 72 other fcs had their dimension sizes different from 2 or 4. Hence the performance boost wasn't as significant as it was in for example, ResNeXt-50.
However, as of MKL-DNN 1.0 the 3 and 5 dimensional input for FC is supported, so once Paddle will be updated to the 1.0, I can introduce the PR that would modify the FC kernel accordingly (so that it can perform computations on 3 dim inputs as well). This will allow for converting all FC layers from BERT to be MKL-DNN FC layers.
Thanks very much for your detail analysis! @Sand3r-
as of MKL-DNN 1.0 the 3 and 5 dimensional input for FC is supported
When will MKL-DNN 1.0 release?
This will allow for converting all FC layers from BERT to be MKL-DNN FC layers.
Could you estimate how much speedup when converting all FC layers to MKL-DNN FC layers?
Intel AI Benchmark Team confirmed the results.
I measured the speed-up by comparing profiling reports from resnext50 run on HSW-e5-2669, I got the following results regarding the fc layer:
When will MKL-DNN 1.0 release?
The estimate of end of Q2 was mentioned, but I haven't heared about any hard deadlines.
Could you estimate how much speedup when converting all FC layers to MKL-DNN FC layers?
According to my estimates, assuming that when BERT is ran on MKL-DNN environement where fc takes up 28.5% of total execution time, the model should gain an overall speedup of 10% on inference.
so that it can perform computations on 3 dim inputs as well
@Sand3r- @jczaja Does #16880 solve this problem?
so that it can perform computations on 3 dim inputs as well
@Sand3r- @jczaja Does #16880 solve this problem?
Nope, that's irrelevant. in the current version of Paddle's MKL-DNN there is a check when creating an FC primitive that ensures that the input is not 3 dimensional. If it is, the primitive is not created and an exception of mkldnn_unimplemented is thrown.
@luotao1 As @Sand3r- mentioned #16880 is a unrelated to FC. But you are very much welcomed to review #16880 :)
@luotao1 @tensor-tang Is there anything else I could explain or clear up?
51% speedup on SKX 6148
- Do you have benchmark on AVX2 machine, such as 2620 v3? @Sand3r-
- For bert_emb768 model, do you test this PR? @jianhang-liu @yihuaxu
Since in bert model the input dimension is not 2 or 4, it can not get more gain from this change.
Weights transposition has been moved to kernel now. All tests pass and PR just waits for approval so that PR_CI_CPU_Py2 (Compile)
can Pass.
Could you update two questions?
- https://github.com/PaddlePaddle/Paddle/pull/15226/files#r278007464 Since we have
fc_fuse_pass
andmkldnn_placement_pass
, why we needfc_mkldnn_pass
? Afterfc_fuse_pass
andmkldnn_placement_pass
, we could call mkldnn kernel offc_op
.- https://github.com/PaddlePaddle/Paddle/pull/15226/files#r278005465 Since we have
PDNode *patterns::FC::operator()
, why we needPDNode *patterns::FCMKLDNN::operator
again? We don't havePDNode *patterns::xxxMKLDNN:: operator()
in this file.
Done.
I am still confused: After fc_fuse_pass
and mkldnn_placement_pass
, we could call mkldnn kernel of fc_op? Should we need fc_mkldnn_pass.cc
?
What happens is essentially this:
use_mkldnn
on all existing operators. During this stage, there is no FC Operator that exists, because muls and elementwise_adds haven't been merged yet.fc_fuse_pass
runs merging muls and elementwise_adds into FC Operator. Even though the muls and elementwise_adds had their use_mkldnn attributes set to true, fc_fuse_pass
ignores that fact and leaves the parameter of its use_mkldnn
to the default value which is false
.Thanks to that behaviour, when fc_mkldnn_pass
is ran after fc_fuse_pass
, it can turn on MKL-DNN's FC. That way we can turn MKL-DNN's FC on and off depending on the topology needs.
Thanks for your explanation!
I see mkldnn_placement_pass
is at the beginning of all the passes.
Thus, if we call mkldnn_placement_pass
after fc_fuse_pass
again, that's to say, call mkldnn_placement_pass
before depthwise_conv_mkldnn_pass
again , should we remove fc_mkldnn_pass
?
https://github.com/PaddlePaddle/Paddle/blob/490e746269b11fefac4e23f7fad2a096f94b7c31/paddle/fluid/inference/api/paddle_pass_builder.cc#L148-L152
Thanks for your explanation! I see
mkldnn_placement_pass
is at the beginning of all the passes. Thus, if we callmkldnn_placement_pass
afterfc_fuse_pass
again, that's to say, callmkldnn_placement_pass
beforedepthwise_conv_mkldnn_pass
again , should we removefc_mkldnn_pass
? https://github.com/PaddlePaddle/Paddle/blob/490e746269b11fefac4e23f7fad2a096f94b7c31/paddle/fluid/inference/api/paddle_pass_builder.cc#L148-L152
Well, yes, but that takes away the possibility to have this operator turned on or off on demand. As far as I recall that was one of the desired features.
Well, yes, but that takes away the possibility to have this operator turned on or off on demand. As far as I recall that was one of the desired features.
Got it. If fc_mkldnn_pass
default off, how about use cfg->pass_builder()->AppendPass(“mkldnn_placement_pass”)
to enable fc_mkldnn_op
?
We would reduce 160 lines if we could use mkldnn_placement_pass
to enable fc_mkldnn_op
.
We would reduce 160 lines if we could use
mkldnn_placement_pass
to enablefc_mkldnn_op
.
With just mkldnn_placement which runs as the first pass that would not be possible as far as I know. One easy way to do it would be to modify fc_fuse pass to take use_mkldnn flag intro consideration when Paddle is built with MKL-DNN.
However as I've mentioned, you will no longer be able to control if you want the MKL-DNN's fc to be enabled or not. Depending on matrix sizes and whether FC is chained with other MKL-DNN operators sometimes it might be more beneficial to use Paddle's (non-mkldnn's) fc.
With just mkldnn_placement which runs as the first pass
AppendPass(“mkldnn_placement_pass”)
means that add mkldnn_placement
as the last pass.
@tensor-tang @jianhang-liu How do you think about https://github.com/PaddlePaddle/Paddle/pull/15226#issuecomment-492987622?
I just wanted to mention that the pattern detector for detecting FC operator will need to appear anyway when implementing int8 version, because it is necessary for Quantization using INT8 V2 mechanism.
Introduces Fully Connected operator. It has a comparable performance to the reference version, although it is necessary to be merged, so that the base for the int8 implementation and its performance gains is established.
Update 05.04.2019 Compared to a reference FC, the MKL-DNN version provides: