kpuatamazon commented 4 years ago

Problem

Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.

When compiled with MKL present (MKL is found in /opt/intel), MXNet calls MKL for dot and batch_dot and DNNL for fully_connected. These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below).

This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL. However, aa1074dc1704d3732ab205c43d48083ef8c69680 upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: https://github.com/oneapi-src/oneDNN/commit/304915096d1def19999b963a60569ec46a882c16 . (MKLDNN has since been renamed DNNL.)

Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).

Testing

Tested with MXNet cfb474ba743d5ea85161bf19875488f4cb409d3c. Compiled with mostly-default cmake settings:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

Then when I run

export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 3.00GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
>>> a = mx.nd.ones(shape=(2,2))
>>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True)
dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971

[[2. 2.]
 [2. 2.]]
<NDArray 2x2 @cpu(0)>
>>> a = mx.nd.ones(shape=(2,2,2))
>>> mx.nd.batch_dot(a,a)
MKL_VERBOSE SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[2. 2.]
  [2. 2.]]

 [[2. 2.]
  [2. 2.]]]
>>> mx.nd.dot(a,a)
MKL_VERBOSE SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4) 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]

 [[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]]
<NDArray 2x2x2x2 @cpu(0)>

You can see DNNL is called for FullyConnected while MKL is called for dot and batch_dot.

Performance impact

I timed Sockeye decoding. Commit https://github.com/apache/incubator-mxnet/commit/aa1074dc1704d3732ab205c43d48083ef8c69680 made decoding 22% slower (416.878s up from 342.037s for b5d07e30321da47d604b99048c1b57c03ec819b0) even with MKL installed in /opt/intel/.

Commit	Compilation	Time(s)
b5d07e3 (before MKLDNN 1.0 change)	Default	342.037
aa1074d (MKLDNN 1.0 change)	Default	416.878
aa1074d (MKLDNN 1.0 change)	Workaround	343.706
cfb474ba Recent	Default	385.587
cfb474ba Recent	Workaround	312.509

(Default compilation is cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..; workaround compilation is below.)

Tested on a Skylake Xeon, c5.9xlarge Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz with OMP_NUM_THREADS=4.

Workaround

Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..

which compiles but then triggers a link error at runtime OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack So I kludged it with export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so and was then able to use MXNet at runtime. There's probably a cleaner way of fixing the linkage.

Recommended fix

When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.

TaoLv commented 4 years ago

Thanks for raising the issue, @kpuatamazon. We plan to optimize the dot and batch_dot operator with DNNL MatMul primitive. But please note that, the performance of MatMul primitive was not optimized until v1.3 which was released last week. That's why we didn't integrate the primitive at the first time when it's introduced in v1.2.

When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.

As mentioned above, dot and batch_dot will also be optimized with DNNL. And as DNNL is more dedicated on deep learning, we will consider DNNL with high priority when both DNNL and MKL are enabled in compilation.

Please provide a simple reproducer if you find FullyConnected based on DNNL is slower than it based on MKL BLAS.

kpuatamazon commented 4 years ago

Here's some benchmarks on 2fff11d4233814aa4ad07858779338090ec2132d (current tip of master) with the same Skylake c5.9xlarge. DNNL is substantially slower than MKL.

(Which DNNL is in master?)

#!/usr/bin/env python3
import mxnet as mx
import time

def time_procedure(shape, count, proc):
  rows, inner, cols = shape
  a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
  b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
  # Burn in
  proc(a, b, cols)
  mx.nd.waitall()
  begin = time.time()
  for i in range(0, count):
    proc(a, b, cols)
    mx.nd.waitall()
  return (time.time() - begin) / count

shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)]
count = 1000

procedures = {
  "fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)),
  "dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
}
for s in shapes:
  print("Shape " + str(s))
  stats = {}
  for name, l in procedures.items():
    stats[name] = time_procedure(s, count, l)
    print("{:.7f} seconds for {}".format(stats[name], name))

Run as OMP_NUM_THREADS=4 ./mult_bench.py:

Shape (5, 512, 512)
0.0000961 seconds for fullyconnected (DNNL)
0.0000509 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002011 seconds for fullyconnected (DNNL)
0.0000735 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002521 seconds for fullyconnected (DNNL)
0.0001027 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003569 seconds for fullyconnected (DNNL)
0.0001018 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000946 seconds for fullyconnected (DNNL)
0.0000496 seconds for dot (MKL)

I don't really mind what the default BLAS implementation is. But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.

kpuatamazon commented 4 years ago

I also tried the latest DNNL:

cd 3rdparty
rm -rf mkldnn
git clone https://github.com/oneapi-src/oneDNN mkldnn
cd ../build && ninja -j 32

and DNNL is still slower:

Shape (5, 512, 512)
0.0000875 seconds for fullyconnected (DNNL)
0.0000507 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002005 seconds for fullyconnected (DNNL)
0.0000729 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002516 seconds for fullyconnected (DNNL)
0.0000974 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003564 seconds for fullyconnected (DNNL)
0.0000981 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000928 seconds for fullyconnected (DNNL)
0.0000497 seconds for dot (MKL)

TaoLv commented 4 years ago

Thank you @kpuatamazon. I will try to reproduce the performance issue and get back to you later.

(Which DNNL is in master?)

It should be v1.2.2 but was down graded to v1.1.2 by https://github.com/apache/incubator-mxnet/pull/17084 (maybe by accident).

But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.

Not sure I understand what do you mean. But using MKL BLAS is documented in:

kpuatamazon commented 4 years ago

This still uses MKLDNN for FullyConnected

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON ..

I was able to get MKL to run by disabling MKLDNN entirely:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF ..

But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes:

mxnet.base.MXNetError: MXNetError: Out of range value for value, value='inf', in operator _full(name="", dtype="float32", value="inf", ctx="cpu(0)", shape="(5, 1)")

(bizarrely this happens after it translated many sentences with that same code.)

How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

TaoLv commented 4 years ago

Yes, USE_MKLDNN is ON by default on Linux. See https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L44. If want to disable DNNL optimizations, you need set USE_MKLDNN to OFF explicitly in the cmake line.

How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

I don't think it's possible at this moment. I think the real question is how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. Take the last shape as an example (on my machine, not ec2):

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0490723
0.0001380 seconds for fullyconnected (DNNL)
0.0000556 seconds for dot (MKL)

You can set DNNL_VERBOSE=1 when running the script to get the verbose output.

kpuatamazon commented 4 years ago

I don't think it's possible at this moment.

Can we please have a build option to force MKL usage for all GEMM calls? This is a pretty big performance regression.

I think the real question is how to improve the performance of FullyConnected when DNNL is used.

Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right? If the MXNet wrapper is the problem, then why is delegating to MKL much faster?

Applying my workaround:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
ninja -j 30
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
export OMP_NUM_THREADS=4
./mult_bench.py #NB I edited the script to label things more clearly.

Output:

Shape (5, 512, 512)
0.0000628 seconds for fullyconnected (MKL workaround)
0.0000488 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000816 seconds for fullyconnected (MKL workaround)
0.0000661 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0001109 seconds for fullyconnected (MKL workaround)
0.0000957 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0001078 seconds for fullyconnected (MKL workaround)
0.0000954 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000501 seconds for fullyconnected (MKL workaround)
0.0000434 seconds for dot (MKL)

There is something less efficient about FullyConnected than dot but it's nowhere near explaining DNNL's slowdown.

FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days.

TaoLv commented 4 years ago

@emfomenk What's -D_DNNL_USE_MKL=FULL for? I'm not aware of this option and guessing it's not a user facing option.

emfomenk commented 4 years ago

@TaoLv, it is a developer's option to enable Intel MKL gemm back (instead of the jitted one). I would suggest to avoid using it, as it internal one (marked with leading underscore). I think it was even broken somewhere in v1.1 to v1.3...

TaoLv commented 4 years ago

@emfomenk Thanks for the explanation. Do you think that enabling this option can improve the performance of inner product primitive for the above cases?

pengzhao-intel commented 4 years ago

@bartekkuncer please help take a look for this issue as well

emfomenk commented 4 years ago

Do you think that enabling this option can improve the performance of inner product primitive for the above cases?

@TaoLv, yes, this could be. Two reasons for that:

If Intel MKL is linked, then most likely libiomp5.so will be used instead of libgomp.so. So, if you see the performance difference, I would suggest to test Intel MKL-DNN v1.0 / DNNL v1.1+ with LD_PRELOAD=libiomp5.so, just to confirm that the issue due to or not due to OpenMP run-time.
Performance bug in Intel MKL-DNN v1.0 / DNNL v1.1+. We definitely fixed few performance issues in the library since v1.0.

I would also suggest to try upgrading the library to the most recent version and check the performance. I strongly suggest not using this developer's option. It is unsupported, not tested, and as I said, it is most likely broken in the subsequent releases.

kpu commented 4 years ago

If Intel MKL is linked

MKL was linked in both cases and in fact called in both cases from dot just not from FullyConnected

libiomp5.so will be used instead of libgomp.so

I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?

I would also suggest to try upgrading the library to the most recent version

See my post above "I also tried the latest DNNL". Performance was unchanged.

emfomenk commented 4 years ago

I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?

Yeah, there should be zero difference in this case :)

I would also suggest to try upgrading the library to the most recent version

See my post above "I also tried the latest DNNL". Performance was unchanged.

Didn't see that, thanks! Yeah, it looks like a performance bug indeed.

pengzhao-intel commented 4 years ago

Thanks @emfomenk & @TaoLv to identify the issue. Our team will follow up on the issue and switch more OP to DNNL soon.

kpuatamazon commented 4 years ago

Single-threaded benchmarks (`OMP_NUM_THREADS=1`) to confirm that it's not a difference in OMP library:	Shape	dot (MKL)	FullyConnected (MKL workaround)	FullyConnected (DNNL)
5, 512, 512	0.0000914	0.0001016	0.0002448	2.40x
5, 512, 1536	0.0002939	0.0003051	0.0006576	2.15x
5, 512, 2048	0.0003828	0.0003919	0.0008582	2.18x
5, 2048, 512	0.0003681	0.0003785	0.0014638	3.86x
4, 512, 512	0.0000917	0.0001038	0.0002364	2.27x

emfomenk commented 4 years ago

Summoning @aaraujom

kpuatamazon commented 4 years ago

@TaoLv Regarding your comment

how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script.

I ran with both MKL and DNNL verbose options: OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1

verbose.txt

To summarize the verbose output in microseconds:	Shape	MKL raw	DNNL raw	DNNL raw slowdown	MKL Python overhead
5, 512, 512	55.71	213.19	3.83x	53.19	61.61
5, 512, 1536	260.18	636.10	2.44x	55.82	62.90
5, 512, 2048	345.66	844.64	2.44x	56.74	65.66
5, 2048, 512	328.68	1346.32	4.10x	56.62	187.78
4, 512, 512	53.75	202.27	3.76x	51.45	60.03

Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.
Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time. Note that MKL used dot whilst DNNL used FullyConnected.

In both cases, I excluded the first call for each shape to allow for JIT/burn-in.

So we really have four issues:

Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output. This is definitely a speed problem with DNNL.
No documented option to compile MKLDNN's kernels with MKL's GEMM. In my opinion, when I compile with -DBLAS set to something, it should use that BLAS. You can add an auto option that picks according to the priorities on the website.
It's weird that dot calls MKL but FullyConnected calls DNNL given that they are both GEMMs. I understand @pengzhao-intel is working on this. Though unless 1 and 2 above are fixed, it will only make performance worse.
MXNet / Python overhead is larger for FullyConnected than dot. This is small compared to the DNNL vs MKL difference though. We can also question why there is so much overhead relative to the cost of the multiply.

Intel should fix DNNL but that will take time.

MXNet should make -DBLAS actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.

kpuatamazon commented 4 years ago

In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for LD_PRELOAD. Just do this before running cmake the first time:

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"

Then cmake can be run normally:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

and the compiled MXNet can be run normally without any special environment variables.

To be clear, the above kludge is an undocumented abomination. The problem with cmake -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include is that it links against MKL twice. DNNL is hard-coded to link dynamically against libmkl_rt.so: https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L64 Then MXNet also links statically against MKL. And it still wants the shared library in dlopen. So we should just remove the shared library. How do we do that? Implement everything else directly without using the build file. Specifically it implements in CXXFLAGS what the rest of build system does https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L65-L67

leezu commented 4 years ago

Does DNNL always link to MKL by default if MKL is available on the system, just as MXNet does link to MKL in such situation?

Further, regarding the double-linking issue, https://github.com/apache/incubator-mxnet/pull/17794 should fix it.

kpuatamazon commented 4 years ago

DNNL has only a hidden unsupported option to link to MKL. It will not link to MKL by default.

ChaiBapchya commented 4 years ago

In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for LD_PRELOAD. Just do this before running cmake the first time:
export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
Then cmake can be run normally:
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
and the compiled MXNet can be run normally without any special environment variables.

@kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround And despite ensuring mkl is installed & using export CXXFlags followed by usual cmake command, build failed with

gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'

I tried the undocumented abominable kludge option you mentioned and that worked smoothly.

export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
rm -rf build/
mkdir -p build && cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
cmake --build . --parallel 1024

Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad

Results	Operator	LHS	RHS	MKL Default
Dot	(4, 512, 512)	(4, 512, 512)	15.1122	4.1254
	(5, 512, 512)	(5, 512, 512)	38.1678	7.5323
	(5, 512, 1536	(5, 512, 1536)	21.6601	19.2503
	(5, 512, 2048)	(5, 512, 2048)	29.0369	23.7432
	(5, 2048, 512)	(5, 2048, 512)	167.5528	129.9957

Batch_dot	(4, 512, 512)	(4, 512, 512)	1.7898	1.5445
	(5, 512, 512)	(5, 512, 512)	2.2457	1.9361
	(5, 512, 1536)	(5, 512, 1536)	6.1453	5.4034
	(5, 512, 2048)	(5, 512, 2048)	8.246	8.0442
	(5, 2048, 512)	(5, 2048, 512)	160.6243	29.0772

	Data	Weight
FC	(4, 512)	(512, 512)	0.0609	0.068
	(5, 512)	(512, 512)	0.0633	0.0731
	(5, 512)	(1536, 512)	0.0916	0.0996
	(5, 512)	(2048, 512)	0.1081	0.1084

However @kpuatamazon when I try to test out with default [i.e. default -> workaround -> default] by unsetting the environment variable LD_PRELOAD, it failed to build default with gemm.cpp:(.text+0xe6b): undefined reference tocblas_gemm_s8u8s32'`

kpuatamazon commented 4 years ago

@ChaiBapchya to be clear, here's how I am building the second option:

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
unset LD_PRELOAD  #Technically this should be what exists in your environment by default
rm -rf build
mkdir build
cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
ninja -j 30

Note that cmake does not appear to pick up on changes to the build so it needs a fresh build directory (deleting the cache might work).

ChaiBapchya commented 4 years ago

Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

Then when I run

export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime

@kpuatamazon @kpu Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with default cmake config doesn't use MKL as blas. Hence we can't get the above exports.

Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.

With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences.

Commands

Default

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

Workaround

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

Logs

Default

Batch_dot

MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.71ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.53ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16

FC

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0551758
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0559082

Workaround

Batch_dot [same as default]

MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.72ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.77ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16

FC [additional MKL_VERBOSE before DNNL_VERBOSE]

MKL_VERBOSE SGEMM(T,N,512,4,512,0x7f985b789c28,0x7f97b5e52e80,512,0x7f97b401e600,512,0x7f985b789c30,0x7f976e89dd80,512) 39.68us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0769043
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f985b789c28,0x7f976c400100,2048,0x7f976e887100,2048,0x7f985b789c30,0x7f976e8da000,512) 79.41us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11377

Results

Operator	LHS	RHS	MKL Default	MKL Workaround
Dot	(4, 512, 512)	(4, 512, 512)	4.1112	4.8241
	(5, 512, 512)	(5, 512, 512)	6.4421	7.607
	(5, 512, 1536	(5, 512, 1536)	20.3648	19.2217
	(5, 512, 2048)	(5, 512, 2048)	23.3236	23.2849
	(5, 2048, 512)	(5, 2048, 512)	123.1235	123.9806

Batch_dot	(4, 512, 512)	(4, 512, 512)	1.4105	1.407
	(5, 512, 512)	(5, 512, 512)	1.7558	1.7511
	(5, 512, 1536)	(5, 512, 1536)	6.5931	6.5585
	(5, 512, 2048)	(5, 512, 2048)	9.1452	9.1031
	(5, 2048, 512)	(5, 2048, 512)	29.0192	28.9236

Operator	Data	Weight	MKL Default	MKL Workaround
FC	(4, 512)	(512, 512)	0.057	0.0685
	(5, 512)	(512, 512)	0.0591	0.0698
	(5, 512)	(1536, 512)	0.0823	0.0939
	(5, 512)	(2048, 512)	0.0916	0.1026
	(5, 2048)	(512, 2048)	0.1146	0.1267

kpu commented 4 years ago

dot hasn't been taken over by MKLDNN = DNNL = oneAPI yet, so you shouldn't expect to see a difference there because it never calls MKLDNN's GEMM anyway. FullyConnected has been taken over by MKLDNN = DNNL = oneAPI. This inconsistency is part of the bug report. Your logs agree with these statements.

Do you really mean LHS (4, 512, 512)? I'm talking about LHS (4,512) RHS (512, 512).

What CPU is this on? Note that mine were on a Skylake Xeon c9.9xlarge. I'd expect the AVX512 kernels to be pretty different from the AVX2 kernels.

What OMP threading were you using? I'd expect for matrices this small that thread scaling is terrible for anything but a tiny number of threads. (I mentioned using 4).

ChaiBapchya commented 4 years ago

Yes logs agree with these statements. But the perf difference isn't visible via opperf. My bad, lhs, rhs wrongly interpreted. But that's for dot,batch_dot. FC values for data, weight are same as yours.

CPU : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Instance p3.8xl

This instance doesn't have avx512

$ lscpu | grep Flags | grep avx
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt

OMP_Threading i used was default.

However trying FC with OMP_NUM_THREADS=4, still shows that workaround makes FC slower [despite it using MKL] than default[which doesn't use MKL]

Workaround [slow]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f0b79ff8c28,0x7f0b62cfb040,2048,0x7f0b6800b740,2048,0x7f0b79ff8c30,0x7f0b68074e80,512) 119.93us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.142822

{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0925}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0921}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1441}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_forward_FullyConnected': 0.1688}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1773}]}]

Default [faster]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11499
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967

{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0691}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0669}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1166}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048},  'avg_time_forward_FullyConnected': 0.1438}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1509}]}]

Python's time-it function in OpPerf

Also, instead of using MXNet's native [built-in] profiler, if i use python time-it function still

Workaround [slow]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f86067f9c28,0x7f85e34fd040,2048,0x7f85f000b740,2048,0x7f86067f9c30,0x7f85f005fc80,512) 120.56us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.143066

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.18504947423934937, 'p50_time_FullyConnected': 0.1817569718696177, 'p90_time_FullyConnected': 0.19327133195474744, 'p99_time_FullyConnected': 0.21545797004364425}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.1922117848880589, 'p50_time_FullyConnected': 0.18283800454810262, 'p90_time_FullyConnected': 0.19954713061451912, 'p99_time_FullyConnected': 0.36291924072429527}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.24743830785155296, 'p50_time_FullyConnected': 0.23927190341055393, 'p90_time_FullyConnected': 0.25965459644794464, 'p99_time_FullyConnected': 0.3613424429204313}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2728361077606678, 'p50_time_FullyConnected': 0.2635499695315957, 'p90_time_FullyConnected': 0.2853184472769499, 'p99_time_FullyConnected': 0.39780672872439016}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.2834270987659693, 'p50_time_FullyConnected': 0.27543300529941916, 'p90_time_FullyConnected': 0.29397032922133803, 'p99_time_FullyConnected': 0.40460130083374674}]}]

Default [faster]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.116943
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.17584662651643157, 'p50_time_FullyConnected': 0.17386197578161955, 'p90_time_FullyConnected': 0.18297710921615362, 'p99_time_FullyConnected': 0.2026369632221758}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.177996635902673, 'p50_time_FullyConnected': 0.17414247849956155, 'p90_time_FullyConnected': 0.18130375538021326, 'p99_time_FullyConnected': 0.23936283076181997}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.22948215948417783, 'p50_time_FullyConnected': 0.22402300965040922, 'p90_time_FullyConnected': 0.2324321656487882, 'p99_time_FullyConnected': 0.3020400775130837}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2527528256177902, 'p50_time_FullyConnected': 0.24718954227864742, 'p90_time_FullyConnected': 0.263673288282007, 'p99_time_FullyConnected': 0.3303098364267497}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.26115449611097574, 'p50_time_FullyConnected': 0.25852309772744775, 'p90_time_FullyConnected': 0.26592351496219635, 'p99_time_FullyConnected': 0.304172352189198}]}]

Since Skylake has AVX512. I'm guessing this FC slowdown might be in AVX512 kernel. Let me reproduce that.

ChaiBapchya commented 4 years ago

Can confirm that this issue is specific to AVX512 kernels. Tried this on c5.xl $ lscpu

Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

Results

Default [slower]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.133789
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.132812

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10202302001744101, 'p50_time_FullyConnected': 0.10086749989568489, 'p90_time_FullyConnected': 0.10658760029400582, 'p99_time_FullyConnected': 0.13521948004836298}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10642346004715364, 'p50_time_FullyConnected': 0.09991750016524747, 'p90_time_FullyConnected': 0.10565369971118344, 'p99_time_FullyConnected': 0.2586996700802042}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.16890607999812346, 'p50_time_FullyConnected': 0.16431500012004108, 'p90_time_FullyConnected': 0.1781331999154645, 'p99_time_FullyConnected': 0.2831235897247094}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.20140223995440465, 'p50_time_FullyConnected': 0.19778950013460417, 'p90_time_FullyConnected': 0.20401089991537447, 'p99_time_FullyConnected': 0.3063294199228036}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.21596427998701984, 'p50_time_FullyConnected': 0.2096700000038254, 'p90_time_FullyConnected': 0.21819640001012885, 'p99_time_FullyConnected': 0.3412436299549877}]}]

MKL Workaround [Faster]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b1405e840,512) 21.25us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0378418
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b14061c00,512) 20.94us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0371094

[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.11772135999308375, 'p50_time_FullyConnected': 0.1149684999290912, 'p90_time_FullyConnected': 0.1244978000613628, 'p99_time_FullyConnected': 0.14825501980340045}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.120828840035756, 'p50_time_FullyConnected': 0.11370450010872446, 'p90_time_FullyConnected': 0.12752780021401122, 'p99_time_FullyConnected': 0.2412066401620902}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.13385597998421872, 'p50_time_FullyConnected': 0.12600750005731243, 'p90_time_FullyConnected': 0.14806160011175962, 'p99_time_FullyConnected': 0.2509373301927551}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.14175208003507578, 'p50_time_FullyConnected': 0.1372545000322134, 'p90_time_FullyConnected': 0.14401020002878798, 'p99_time_FullyConnected': 0.2423993399725075}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.143890859962994, 'p50_time_FullyConnected': 0.1397979999637755, 'p90_time_FullyConnected': 0.14637689982919258, 'p99_time_FullyConnected': 0.22678783964693117}]}]

To reproduce https://gist.github.com/ChaiBapchya/a849cfd566b8114e695454850b48077b https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad#file-benchmark_intel_mkl-py

aaraujom commented 4 years ago

Hi thanks for checking. Let me try to reproduce this in my end and fix it.

ChaiBapchya commented 4 years ago

@aaraujom any update?

aaraujom commented 4 years ago

Hi @ChaiBapchya - I was able to reproduce issue on my end. DNNL is missing some kernels for very specific problem sizes affecting the gemm size in this thread. We will address this issue in DNNL.

kpuatamazon commented 4 years ago

Mostly fixed by the v1.6.x release of oneDNN. There is still a speed discrepancy but it's not as bad.

git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 (current v1.x branch of MXNet including oneDNN 1.6.1)

Shape (5, 512, 512)
0.0000471 seconds for fullyconnected (DNNL)
0.0000413 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000691 seconds for fullyconnected (DNNL)
0.0000639 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0000936 seconds for fullyconnected (DNNL)
0.0000868 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0000949 seconds for fullyconnected (DNNL)
0.0000870 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000446 seconds for fullyconnected (DNNL)
0.0000392 seconds for dot (MKL)

Revert to oneDNN v1.3.
git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 && git revert 17111037176294d86b7dc6f719347aa24c96b317 && git revert eae6171cb56190cf06f5b11597380c082745991b

Shape (5, 512, 512)
0.0000842 seconds for fullyconnected (DNNL)
0.0000407 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0001908 seconds for fullyconnected (DNNL)
0.0000606 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002421 seconds for fullyconnected (DNNL)
0.0000880 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003462 seconds for fullyconnected (DNNL)
0.0000912 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000841 seconds for fullyconnected (DNNL)
0.0000395 seconds for dot (MKL)

aaraujom commented 4 years ago

@ChaiBapchya - Will those improvements work for you?

ChaiBapchya commented 4 years ago

Is there a way to bring down the deficit and have parity? or do we know the reason why DNNL is slower? @TaoLv @aaraujom

b-shi commented 4 years ago

Hi, we benchmarked gemm performance for those sizes internally using oneDNN vs oneMKL directly and here is what we got:

Max (20K trials) GFlops	oneMKL (1Th)	oneDNN (1Th)	oneMKL (4Th)	oneDNN (4Th)	oneMKL (8Th)	oneDNN (8Th)	oneMKL (28Th)	oneDNN (28Th)
5, 512, 512	34.49	40.96	174.76	262.14	291.27	436.91	374.49	524.29
5, 512, 1536	26.57	27.40	178.73	231.30	357.47	491.52	561.74	873.81
5, 512, 2048	25.83	26.55	143.64	174.76	338.25	436.91	616.81	1048.58
5, 2048, 512	25.51	26.82	139.81	183.96	388.36	524.29	873.81	1310.72
4, 512, 512	31.78	42.80	161.32	233.02	262.14	419.43	299.59	419.43

Avg (20K trials) GFlops	oneMKL (1Th)	oneDNN (1Th)	oneMKL (4Th)	oneDNN (4Th)	oneMKL (8Th)	oneDNN (8Th)	oneMKL (28Th)	oneDNN (28Th)
5, 512, 512	33.96	39.45	169.05	233.88	258.32	364.93	328.68	399.95
5, 512, 1536	26.30	27.13	174.04	226.75	342.51	463.81	492.14	765.09
5, 512, 2048	25.58	26.35	140.57	171.42	328.70	429.25	589.81	873.36
5, 2048, 512	25.27	26.57	137.35	179.91	368.89	508.38	792.29	1101.67
4, 512, 512	31.02	41.91	146.59	216.20	226.68	335.54	271.98	334.98

TaoLv commented 4 years ago

@ChaiBapchya Is it possible to compare the performance of FC in two versions? Comparing FC with dot is not so fair to me. As you may know, they are using different execution paths, FCompute vs. FComputeEx.

pengzhao-intel commented 4 years ago

closing since the fix is updated with the new MKLDNN.

ChaiBapchya commented 4 years ago

@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right? https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?

pengzhao-intel commented 4 years ago

@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right? https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?

Thanks to raise the question. China team is on a public holiday this week. @anko-intel could you help take a look at the issue?

anko-intel commented 4 years ago

@ChaiBapchya, You are right, we also have to include 1.6.4 (https://github.com/apache/incubator-mxnet/pull/19251) to make it happened.

apache / mxnet

When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980

Problem

Testing

Performance impact

Workaround

Recommended fix

Commands

Logs

Default

Workaround

Results

Workaround [slow]

Default [faster]

Python's time-it function in OpPerf

Workaround [slow]

Default [faster]

Results

Default [slower]

MKL Workaround [Faster]