Closed kpuatamazon closed 4 years ago
Thanks for raising the issue, @kpuatamazon. We plan to optimize the dot and batch_dot operator with DNNL MatMul primitive. But please note that, the performance of MatMul primitive was not optimized until v1.3 which was released last week. That's why we didn't integrate the primitive at the first time when it's introduced in v1.2.
When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.
As mentioned above, dot and batch_dot will also be optimized with DNNL. And as DNNL is more dedicated on deep learning, we will consider DNNL with high priority when both DNNL and MKL are enabled in compilation.
Please provide a simple reproducer if you find FullyConnected based on DNNL is slower than it based on MKL BLAS.
Here's some benchmarks on 2fff11d4233814aa4ad07858779338090ec2132d (current tip of master) with the same Skylake c5.9xlarge. DNNL is substantially slower than MKL.
(Which DNNL is in master?)
#!/usr/bin/env python3
import mxnet as mx
import time
def time_procedure(shape, count, proc):
rows, inner, cols = shape
a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
# Burn in
proc(a, b, cols)
mx.nd.waitall()
begin = time.time()
for i in range(0, count):
proc(a, b, cols)
mx.nd.waitall()
return (time.time() - begin) / count
shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)]
count = 1000
procedures = {
"fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)),
"dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
}
for s in shapes:
print("Shape " + str(s))
stats = {}
for name, l in procedures.items():
stats[name] = time_procedure(s, count, l)
print("{:.7f} seconds for {}".format(stats[name], name))
Run as OMP_NUM_THREADS=4 ./mult_bench.py
:
Shape (5, 512, 512)
0.0000961 seconds for fullyconnected (DNNL)
0.0000509 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002011 seconds for fullyconnected (DNNL)
0.0000735 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002521 seconds for fullyconnected (DNNL)
0.0001027 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003569 seconds for fullyconnected (DNNL)
0.0001018 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000946 seconds for fullyconnected (DNNL)
0.0000496 seconds for dot (MKL)
I don't really mind what the default BLAS implementation is. But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.
I also tried the latest DNNL:
cd 3rdparty
rm -rf mkldnn
git clone https://github.com/oneapi-src/oneDNN mkldnn
cd ../build && ninja -j 32
and DNNL is still slower:
Shape (5, 512, 512)
0.0000875 seconds for fullyconnected (DNNL)
0.0000507 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002005 seconds for fullyconnected (DNNL)
0.0000729 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002516 seconds for fullyconnected (DNNL)
0.0000974 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003564 seconds for fullyconnected (DNNL)
0.0000981 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000928 seconds for fullyconnected (DNNL)
0.0000497 seconds for dot (MKL)
Thank you @kpuatamazon. I will try to reproduce the performance issue and get back to you later.
(Which DNNL is in master?)
It should be v1.2.2 but was down graded to v1.1.2 by https://github.com/apache/incubator-mxnet/pull/17084 (maybe by accident).
But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.
Not sure I understand what do you mean. But using MKL BLAS is documented in:
This still uses MKLDNN for FullyConnected
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON ..
I was able to get MKL to run by disabling MKLDNN entirely:
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF ..
But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes:
mxnet.base.MXNetError: MXNetError: Out of range value for value, value='inf', in operator _full(name="", dtype="float32", value="inf", ctx="cpu(0)", shape="(5, 1)")
(bizarrely this happens after it translated many sentences with that same code.)
How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected
using only documented options?
Yes, USE_MKLDNN
is ON by default on Linux. See https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L44. If want to disable DNNL optimizations, you need set USE_MKLDNN
to OFF explicitly in the cmake line.
How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?
I don't think it's possible at this moment. I think the real question is how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. Take the last shape as an example (on my machine, not ec2):
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0490723
0.0001380 seconds for fullyconnected (DNNL)
0.0000556 seconds for dot (MKL)
You can set DNNL_VERBOSE=1 when running the script to get the verbose output.
I don't think it's possible at this moment.
Can we please have a build option to force MKL usage for all GEMM calls? This is a pretty big performance regression.
I think the real question is how to improve the performance of FullyConnected when DNNL is used.
Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right? If the MXNet wrapper is the problem, then why is delegating to MKL much faster?
Applying my workaround:
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
ninja -j 30
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
export OMP_NUM_THREADS=4
./mult_bench.py #NB I edited the script to label things more clearly.
Output:
Shape (5, 512, 512)
0.0000628 seconds for fullyconnected (MKL workaround)
0.0000488 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000816 seconds for fullyconnected (MKL workaround)
0.0000661 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0001109 seconds for fullyconnected (MKL workaround)
0.0000957 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0001078 seconds for fullyconnected (MKL workaround)
0.0000954 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000501 seconds for fullyconnected (MKL workaround)
0.0000434 seconds for dot (MKL)
There is something less efficient about FullyConnected
than dot
but it's nowhere near explaining DNNL's slowdown.
FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days.
@emfomenk What's -D_DNNL_USE_MKL=FULL
for? I'm not aware of this option and guessing it's not a user facing option.
@TaoLv, it is a developer's option to enable Intel MKL gemm back (instead of the jitted one). I would suggest to avoid using it, as it internal one (marked with leading underscore). I think it was even broken somewhere in v1.1 to v1.3...
@emfomenk Thanks for the explanation. Do you think that enabling this option can improve the performance of inner product primitive for the above cases?
@bartekkuncer please help take a look for this issue as well
Do you think that enabling this option can improve the performance of inner product primitive for the above cases?
@TaoLv, yes, this could be. Two reasons for that:
libiomp5.so
will be used instead of libgomp.so
. So, if you see the performance difference, I would suggest to test Intel MKL-DNN v1.0 / DNNL v1.1+ with LD_PRELOAD=libiomp5.so
, just to confirm that the issue due to or not due to OpenMP run-time.I would also suggest to try upgrading the library to the most recent version and check the performance. I strongly suggest not using this developer's option. It is unsupported, not tested, and as I said, it is most likely broken in the subsequent releases.
If Intel MKL is linked
MKL was linked in both cases and in fact called in both cases from dot
just not from FullyConnected
libiomp5.so will be used instead of libgomp.so
I tried single-threaded (OMP_NUM_THREADS=1
) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?
I would also suggest to try upgrading the library to the most recent version
See my post above "I also tried the latest DNNL". Performance was unchanged.
I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?
Yeah, there should be zero difference in this case :)
I would also suggest to try upgrading the library to the most recent version
See my post above "I also tried the latest DNNL". Performance was unchanged.
Didn't see that, thanks! Yeah, it looks like a performance bug indeed.
Thanks @emfomenk & @TaoLv to identify the issue. Our team will follow up on the issue and switch more OP to DNNL soon.
Single-threaded benchmarks (OMP_NUM_THREADS=1 ) to confirm that it's not a difference in OMP library: |
Shape | dot (MKL) | FullyConnected (MKL workaround) | FullyConnected (DNNL) | DNNL slowdown |
---|---|---|---|---|---|
5, 512, 512 | 0.0000914 | 0.0001016 | 0.0002448 | 2.40x | |
5, 512, 1536 | 0.0002939 | 0.0003051 | 0.0006576 | 2.15x | |
5, 512, 2048 | 0.0003828 | 0.0003919 | 0.0008582 | 2.18x | |
5, 2048, 512 | 0.0003681 | 0.0003785 | 0.0014638 | 3.86x | |
4, 512, 512 | 0.0000917 | 0.0001038 | 0.0002364 | 2.27x |
Summoning @aaraujom
@TaoLv Regarding your comment
how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script.
I ran with both MKL and DNNL verbose options: OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1
To summarize the verbose output in microseconds: | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | DNNL Python overhead |
---|---|---|---|---|---|---|
5, 512, 512 | 55.71 | 213.19 | 3.83x | 53.19 | 61.61 | |
5, 512, 1536 | 260.18 | 636.10 | 2.44x | 55.82 | 62.90 | |
5, 512, 2048 | 345.66 | 844.64 | 2.44x | 56.74 | 65.66 | |
5, 2048, 512 | 328.68 | 1346.32 | 4.10x | 56.62 | 187.78 | |
4, 512, 512 | 53.75 | 202.27 | 3.76x | 51.45 | 60.03 |
Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.
Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time. Note that MKL used dot
whilst DNNL used FullyConnected
.
In both cases, I excluded the first call for each shape to allow for JIT/burn-in.
So we really have four issues:
-DBLAS
set to something, it should use that BLAS. You can add an auto
option that picks according to the priorities on the website. dot
calls MKL but FullyConnected
calls DNNL given that they are both GEMMs. I understand @pengzhao-intel is working on this. Though unless 1 and 2 above are fixed, it will only make performance worse. FullyConnected
than dot
. This is small compared to the DNNL vs MKL difference though. We can also question why there is so much overhead relative to the cost of the multiply. Intel should fix DNNL but that will take time.
MXNet should make -DBLAS
actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.
In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for LD_PRELOAD
. Just do this before running cmake the first time:
export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
Then cmake
can be run normally:
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
and the compiled MXNet can be run normally without any special environment variables.
To be clear, the above kludge is an undocumented abomination. The problem with cmake -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include
is that it links against MKL twice. DNNL is hard-coded to link dynamically against libmkl_rt.so
:
https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L64
Then MXNet also links statically against MKL. And it still wants the shared library in dlopen
. So we should just remove the shared library. How do we do that? Implement everything else directly without using the build file. Specifically it implements in CXXFLAGS
what the rest of build system does https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L65-L67
Does DNNL always link to MKL by default if MKL is available on the system, just as MXNet does link to MKL in such situation?
Further, regarding the double-linking issue, https://github.com/apache/incubator-mxnet/pull/17794 should fix it.
DNNL has only a hidden unsupported option to link to MKL. It will not link to MKL by default.
In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for
LD_PRELOAD
. Just do this before running cmake the first time:export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
Then
cmake
can be run normally:cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
and the compiled MXNet can be run normally without any special environment variables.
@kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround And despite ensuring mkl is installed & using export CXXFlags followed by usual cmake command, build failed with
gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'
I tried the undocumented abominable kludge option you mentioned and that worked smoothly.
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
rm -rf build/
mkdir -p build && cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
cmake --build . --parallel 1024
Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad
Results | Operator | LHS | RHS | MKL Default | MKL Workaround |
---|---|---|---|---|---|
Dot | (4, 512, 512) | (4, 512, 512) | 15.1122 | 4.1254 | |
(5, 512, 512) | (5, 512, 512) | 38.1678 | 7.5323 | ||
(5, 512, 1536 | (5, 512, 1536) | 21.6601 | 19.2503 | ||
(5, 512, 2048) | (5, 512, 2048) | 29.0369 | 23.7432 | ||
(5, 2048, 512) | (5, 2048, 512) | 167.5528 | 129.9957 | ||
Batch_dot | (4, 512, 512) | (4, 512, 512) | 1.7898 | 1.5445 | |
(5, 512, 512) | (5, 512, 512) | 2.2457 | 1.9361 | ||
(5, 512, 1536) | (5, 512, 1536) | 6.1453 | 5.4034 | ||
(5, 512, 2048) | (5, 512, 2048) | 8.246 | 8.0442 | ||
(5, 2048, 512) | (5, 2048, 512) | 160.6243 | 29.0772 | ||
Data | Weight | ||||
FC | (4, 512) | (512, 512) | 0.0609 | 0.068 | |
(5, 512) | (512, 512) | 0.0633 | 0.0731 | ||
(5, 512) | (1536, 512) | 0.0916 | 0.0996 | ||
(5, 512) | (2048, 512) | 0.1081 | 0.1084 | ||
However @kpuatamazon when I try to test out with default [i.e. default -> workaround -> default] by unsetting the environment variable LD_PRELOAD, it failed to build default with gemm.cpp:(.text+0xe6b): undefined reference to
cblas_gemm_s8u8s32'`
@ChaiBapchya to be clear, here's how I am building the second option:
export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
unset LD_PRELOAD #Technically this should be what exists in your environment by default
rm -rf build
mkdir build
cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
ninja -j 30
Note that cmake
does not appear to pick up on changes to the build so it needs a fresh build directory (deleting the cache might work).
Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
Then when I run
export MKL_VERBOSE=1 export MKLDNN_VERBOSE=1 python3 Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import mxnet as mx Numpy + Intel(R) MKL: THREADING LAYER: (null) Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime Numpy + Intel(R) MKL: preloading libiomp5.so runtime
@kpuatamazon @kpu Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with default cmake config doesn't use MKL as blas. Hence we can't get the above exports.
Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.
With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences.
Default
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
Workaround
export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
Batch_dot
MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.71ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.53ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
FC
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0551758
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0559082
Batch_dot [same as default]
MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.72ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.77ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
FC [additional MKL_VERBOSE before DNNL_VERBOSE]
MKL_VERBOSE SGEMM(T,N,512,4,512,0x7f985b789c28,0x7f97b5e52e80,512,0x7f97b401e600,512,0x7f985b789c30,0x7f976e89dd80,512) 39.68us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0769043
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f985b789c28,0x7f976c400100,2048,0x7f976e887100,2048,0x7f985b789c30,0x7f976e8da000,512) 79.41us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11377
Operator | LHS | RHS | MKL Default | MKL Workaround |
---|---|---|---|---|
Dot | (4, 512, 512) | (4, 512, 512) | 4.1112 | 4.8241 |
(5, 512, 512) | (5, 512, 512) | 6.4421 | 7.607 | |
(5, 512, 1536 | (5, 512, 1536) | 20.3648 | 19.2217 | |
(5, 512, 2048) | (5, 512, 2048) | 23.3236 | 23.2849 | |
(5, 2048, 512) | (5, 2048, 512) | 123.1235 | 123.9806 | |
Batch_dot | (4, 512, 512) | (4, 512, 512) | 1.4105 | 1.407 |
(5, 512, 512) | (5, 512, 512) | 1.7558 | 1.7511 | |
(5, 512, 1536) | (5, 512, 1536) | 6.5931 | 6.5585 | |
(5, 512, 2048) | (5, 512, 2048) | 9.1452 | 9.1031 | |
(5, 2048, 512) | (5, 2048, 512) | 29.0192 | 28.9236 |
Operator | Data | Weight | MKL Default | MKL Workaround |
---|---|---|---|---|
FC | (4, 512) | (512, 512) | 0.057 | 0.0685 |
(5, 512) | (512, 512) | 0.0591 | 0.0698 | |
(5, 512) | (1536, 512) | 0.0823 | 0.0939 | |
(5, 512) | (2048, 512) | 0.0916 | 0.1026 | |
(5, 2048) | (512, 2048) | 0.1146 | 0.1267 |
dot
hasn't been taken over by MKLDNN = DNNL = oneAPI yet, so you shouldn't expect to see a difference there because it never calls MKLDNN's GEMM anyway. FullyConnected has been taken over by MKLDNN = DNNL = oneAPI. This inconsistency is part of the bug report. Your logs agree with these statements.
Do you really mean LHS (4, 512, 512)? I'm talking about LHS (4,512) RHS (512, 512).
What CPU is this on? Note that mine were on a Skylake Xeon c9.9xlarge
. I'd expect the AVX512 kernels to be pretty different from the AVX2 kernels.
What OMP threading were you using? I'd expect for matrices this small that thread scaling is terrible for anything but a tiny number of threads. (I mentioned using 4).
Yes logs agree with these statements. But the perf difference isn't visible via opperf. My bad, lhs, rhs wrongly interpreted. But that's for dot,batch_dot. FC values for data, weight are same as yours.
CPU : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Instance p3.8xl
This instance doesn't have avx512
$ lscpu | grep Flags | grep avx
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
OMP_Threading i used was default.
However trying FC with OMP_NUM_THREADS=4, still shows that workaround makes FC slower [despite it using MKL] than default[which doesn't use MKL]
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f0b79ff8c28,0x7f0b62cfb040,2048,0x7f0b6800b740,2048,0x7f0b79ff8c30,0x7f0b68074e80,512) 119.93us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.142822
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0925},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0921},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1441},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_forward_FullyConnected': 0.1688},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1773}]}]
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11499
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0691},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0669},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1166},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_forward_FullyConnected': 0.1438},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1509}]}]
Also, instead of using MXNet's native [built-in] profiler, if i use python time-it function still
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f86067f9c28,0x7f85e34fd040,2048,0x7f85f000b740,2048,0x7f86067f9c30,0x7f85f005fc80,512) 120.56us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.143066
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.18504947423934937, 'p50_time_FullyConnected': 0.1817569718696177, 'p90_time_FullyConnected': 0.19327133195474744, 'p99_time_FullyConnected': 0.21545797004364425},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.1922117848880589, 'p50_time_FullyConnected': 0.18283800454810262, 'p90_time_FullyConnected': 0.19954713061451912, 'p99_time_FullyConnected': 0.36291924072429527},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.24743830785155296, 'p50_time_FullyConnected': 0.23927190341055393, 'p90_time_FullyConnected': 0.25965459644794464, 'p99_time_FullyConnected': 0.3613424429204313},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2728361077606678, 'p50_time_FullyConnected': 0.2635499695315957, 'p90_time_FullyConnected': 0.2853184472769499, 'p99_time_FullyConnected': 0.39780672872439016},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.2834270987659693, 'p50_time_FullyConnected': 0.27543300529941916, 'p90_time_FullyConnected': 0.29397032922133803, 'p99_time_FullyConnected': 0.40460130083374674}]}]
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.116943
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.17584662651643157, 'p50_time_FullyConnected': 0.17386197578161955, 'p90_time_FullyConnected': 0.18297710921615362, 'p99_time_FullyConnected': 0.2026369632221758},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.177996635902673, 'p50_time_FullyConnected': 0.17414247849956155, 'p90_time_FullyConnected': 0.18130375538021326, 'p99_time_FullyConnected': 0.23936283076181997},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.22948215948417783, 'p50_time_FullyConnected': 0.22402300965040922, 'p90_time_FullyConnected': 0.2324321656487882, 'p99_time_FullyConnected': 0.3020400775130837},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2527528256177902, 'p50_time_FullyConnected': 0.24718954227864742, 'p90_time_FullyConnected': 0.263673288282007, 'p99_time_FullyConnected': 0.3303098364267497},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.26115449611097574, 'p50_time_FullyConnected': 0.25852309772744775, 'p90_time_FullyConnected': 0.26592351496219635, 'p99_time_FullyConnected': 0.304172352189198}]}]
Since Skylake has AVX512. I'm guessing this FC slowdown might be in AVX512 kernel. Let me reproduce that.
Can confirm that this issue is specific to AVX512 kernels. Tried this on c5.xl $ lscpu
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.133789
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.132812
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10202302001744101, 'p50_time_FullyConnected': 0.10086749989568489, 'p90_time_FullyConnected': 0.10658760029400582, 'p99_time_FullyConnected': 0.13521948004836298},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10642346004715364, 'p50_time_FullyConnected': 0.09991750016524747, 'p90_time_FullyConnected': 0.10565369971118344, 'p99_time_FullyConnected': 0.2586996700802042},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.16890607999812346, 'p50_time_FullyConnected': 0.16431500012004108, 'p90_time_FullyConnected': 0.1781331999154645, 'p99_time_FullyConnected': 0.2831235897247094},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.20140223995440465, 'p50_time_FullyConnected': 0.19778950013460417, 'p90_time_FullyConnected': 0.20401089991537447, 'p99_time_FullyConnected': 0.3063294199228036},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.21596427998701984, 'p50_time_FullyConnected': 0.2096700000038254, 'p90_time_FullyConnected': 0.21819640001012885, 'p99_time_FullyConnected': 0.3412436299549877}]}]
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b1405e840,512) 21.25us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0378418
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b14061c00,512) 20.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0371094
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.11772135999308375, 'p50_time_FullyConnected': 0.1149684999290912, 'p90_time_FullyConnected': 0.1244978000613628, 'p99_time_FullyConnected': 0.14825501980340045},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.120828840035756, 'p50_time_FullyConnected': 0.11370450010872446, 'p90_time_FullyConnected': 0.12752780021401122, 'p99_time_FullyConnected': 0.2412066401620902},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.13385597998421872, 'p50_time_FullyConnected': 0.12600750005731243, 'p90_time_FullyConnected': 0.14806160011175962, 'p99_time_FullyConnected': 0.2509373301927551},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.14175208003507578, 'p50_time_FullyConnected': 0.1372545000322134, 'p90_time_FullyConnected': 0.14401020002878798, 'p99_time_FullyConnected': 0.2423993399725075},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.143890859962994, 'p50_time_FullyConnected': 0.1397979999637755, 'p90_time_FullyConnected': 0.14637689982919258, 'p99_time_FullyConnected': 0.22678783964693117}]}]
To reproduce https://gist.github.com/ChaiBapchya/a849cfd566b8114e695454850b48077b https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad#file-benchmark_intel_mkl-py
Hi thanks for checking. Let me try to reproduce this in my end and fix it.
@aaraujom any update?
Hi @ChaiBapchya - I was able to reproduce issue on my end. DNNL is missing some kernels for very specific problem sizes affecting the gemm size in this thread. We will address this issue in DNNL.
Mostly fixed by the v1.6.x release of oneDNN. There is still a speed discrepancy but it's not as bad.
git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41
(current v1.x branch of MXNet including oneDNN 1.6.1)
Shape (5, 512, 512)
0.0000471 seconds for fullyconnected (DNNL)
0.0000413 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000691 seconds for fullyconnected (DNNL)
0.0000639 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0000936 seconds for fullyconnected (DNNL)
0.0000868 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0000949 seconds for fullyconnected (DNNL)
0.0000870 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000446 seconds for fullyconnected (DNNL)
0.0000392 seconds for dot (MKL)
Revert to oneDNN v1.3.
git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 && git revert 17111037176294d86b7dc6f719347aa24c96b317 && git revert eae6171cb56190cf06f5b11597380c082745991b
Shape (5, 512, 512)
0.0000842 seconds for fullyconnected (DNNL)
0.0000407 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0001908 seconds for fullyconnected (DNNL)
0.0000606 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002421 seconds for fullyconnected (DNNL)
0.0000880 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003462 seconds for fullyconnected (DNNL)
0.0000912 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000841 seconds for fullyconnected (DNNL)
0.0000395 seconds for dot (MKL)
@ChaiBapchya - Will those improvements work for you?
Is there a way to bring down the deficit and have parity? or do we know the reason why DNNL is slower? @TaoLv @aaraujom
Hi, we benchmarked gemm performance for those sizes internally using oneDNN vs oneMKL directly and here is what we got:
Max (20K trials) GFlops | oneMKL (1Th) | oneDNN (1Th) | oneMKL (4Th) | oneDNN (4Th) | oneMKL (8Th) | oneDNN (8Th) | oneMKL (28Th) | oneDNN (28Th) |
---|---|---|---|---|---|---|---|---|
5, 512, 512 | 34.49 | 40.96 | 174.76 | 262.14 | 291.27 | 436.91 | 374.49 | 524.29 |
5, 512, 1536 | 26.57 | 27.40 | 178.73 | 231.30 | 357.47 | 491.52 | 561.74 | 873.81 |
5, 512, 2048 | 25.83 | 26.55 | 143.64 | 174.76 | 338.25 | 436.91 | 616.81 | 1048.58 |
5, 2048, 512 | 25.51 | 26.82 | 139.81 | 183.96 | 388.36 | 524.29 | 873.81 | 1310.72 |
4, 512, 512 | 31.78 | 42.80 | 161.32 | 233.02 | 262.14 | 419.43 | 299.59 | 419.43 |
Avg (20K trials) GFlops | oneMKL (1Th) | oneDNN (1Th) | oneMKL (4Th) | oneDNN (4Th) | oneMKL (8Th) | oneDNN (8Th) | oneMKL (28Th) | oneDNN (28Th) |
---|---|---|---|---|---|---|---|---|
5, 512, 512 | 33.96 | 39.45 | 169.05 | 233.88 | 258.32 | 364.93 | 328.68 | 399.95 |
5, 512, 1536 | 26.30 | 27.13 | 174.04 | 226.75 | 342.51 | 463.81 | 492.14 | 765.09 |
5, 512, 2048 | 25.58 | 26.35 | 140.57 | 171.42 | 328.70 | 429.25 | 589.81 | 873.36 |
5, 2048, 512 | 25.27 | 26.57 | 137.35 | 179.91 | 368.89 | 508.38 | 792.29 | 1101.67 |
4, 512, 512 | 31.02 | 41.91 | 146.59 | 216.20 | 226.68 | 335.54 | 271.98 | 334.98 |
@ChaiBapchya Is it possible to compare the performance of FC in two versions? Comparing FC with dot is not so fair to me. As you may know, they are using different execution paths, FCompute vs. FComputeEx.
closing since the fix is updated with the new MKLDNN.
@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right? https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?
@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right? https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?
Thanks to raise the question. China team is on a public holiday this week. @anko-intel could you help take a look at the issue?
@ChaiBapchya, You are right, we also have to include 1.6.4 (https://github.com/apache/incubator-mxnet/pull/19251) to make it happened.
Problem
Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.
When compiled with MKL present (MKL is found in
/opt/intel
), MXNet calls MKL fordot
andbatch_dot
and DNNL forfully_connected
. These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below).This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL. However, aa1074dc1704d3732ab205c43d48083ef8c69680 upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: https://github.com/oneapi-src/oneDNN/commit/304915096d1def19999b963a60569ec46a882c16 . (MKLDNN has since been renamed DNNL.)
Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).
Testing
Tested with MXNet cfb474ba743d5ea85161bf19875488f4cb409d3c. Compiled with mostly-default cmake settings:
Then when I run
You can see DNNL is called for
FullyConnected
while MKL is called fordot
andbatch_dot
.Performance impact
I timed Sockeye decoding. Commit https://github.com/apache/incubator-mxnet/commit/aa1074dc1704d3732ab205c43d48083ef8c69680 made decoding 22% slower (416.878s up from 342.037s for b5d07e30321da47d604b99048c1b57c03ec819b0) even with MKL installed in
/opt/intel/
.(Default compilation is
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
; workaround compilation is below.)Tested on a Skylake Xeon, c5.9xlarge
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
withOMP_NUM_THREADS=4
.Workaround
Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.
which compiles but then triggers a link error at runtime
OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack
So I kludged it withexport LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
and was then able to use MXNet at runtime. There's probably a cleaner way of fixing the linkage.Recommended fix
When compiled with MKL, MXNet should call MKL directly from
FullyConnected
like it already does fordot
andbatch_dot
.