[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL

Zha0q1 commented 3 years ago

I was trying to build MXNet 1.x with MKLDNN with ACL (Arm Compute Library) integration on an Arm instance. I used this cmake config file to integrate MKLDNN with ACL. The build was very performant and would surely benefit MXNet users hugely. I got a 3-4X boost with (16/64, 3, 512, 512) on Resnet compared to MKLDNN with no integration. However two operator unit test failed on this build:

test_operator.test_convolution_grouping 
test_operator.test_convolution_independent_gradients

I tried multiple mkldnn versions (v1.x now points to mkldnn release 2.0 beta 10, I also tried release 2.1 and 2.2) and ACl versions (20.08 and 21.2, 20.08 is Aug 2020 and 21.2 is the latest) and the failures persisted.

This got me suspect that there is some integration issue with MXNet-MKLDNN (more likely) or MKLDNN-ACL. Would someone from the Intel team help share some insights on this?

Would you help triage?@szha @leezu

CC @josephevans @mseth10 @waytrue17 @sandeep-krishnamurthy

Zha0q1 commented 3 years ago

Here is the source code of the first test case test_operator.test_convolution_grouping https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1674-L1703

@with_seed()
def test_convolution_grouping():
    for dim in [1, 2, 3]:
        num_filter = 4
        for num_group in [1, 2]:
            kernel = (3,) * dim
            shape = (1, 4) + (9,) * dim

The test will only fail when dim=2 and num_group=2, i.e. when kernel=(3, 3) and shape=(1, 4, 9, 9).

So the test case will check if the Conv op will produce the same output and gradients (for x, w, b) when num_group=2 as those of when we do the "grouping" manually. So it looks like for all of output, x gradient, w gradient, and b gradient the first halves of then tensors are the same, but the second halves differ.

Mismatched elements: 98 / 196 (50%)
Max absolute difference: 16.347755
Max relative difference: 23.329699
 x: array([[[[ 1.597207e-01,  1.540166e+00,  6.755731e+00, -5.728347e+00,
           1.296894e+00, -1.326614e+00,  6.375989e+00],
         [ 1.170041e+00,  3.974693e-01,  5.603638e+00, -1.179677e+00,...
 y: array([[[[ 1.597207e-01,  1.540166e+00,  6.755731e+00, -5.728347e+00,
           1.296894e+00, -1.326614e+00,  6.375989e+00],
         [ 1.170040e+00,  3.974691e-01,  5.603638e+00, -1.179677e+00,...

command to run test:

nosetests-3.4 --with-coverage --cover-inclusive --cover-xml --cover-branches --cover-package=mxnet --with-timer --timer-ok 1 --timer-warning 15 --timer-filter warning,error --with-xunit --xunit-file nosetests_unittest.xml --verbose tests/python/unittest/test_operator.py:test_convolution_grouping

Zha0q1 commented 3 years ago

Here is the source code of the second test case https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1750-L1829

@with_seed()
def test_convolution_independent_gradients():
    ctx = mx.cpu()
    atol = 1.0e-3
    rtol = 1.0e-3
    reqs = ["null", "write", "add"]
    var_names = ["x", "w", "b"]
    dims = [2]
    num_bases = [1, 8]
    kernel_xs = [3, 5]
    stride_xs = [1, 2]
    pad_xs = [0, 1]
    in_sizes = [7, 32]
    no_biases = [True, False]

This test will only fail when dim=2. However, if at the same time we set reqs = ["write", "add"] rather than reqs = ["null", "write", "add"], the test will also run fine.

failure:

======================================================================
FAIL: test_operator.test_convolution_independent_gradients
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/incubator-mxnet/tests/python/unittest/common.py", line 218, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/tests/python/unittest/test_operator.py", line 2242, in test_convolution_independent_gradients
    exe2.outputs[0].asnumpy(), rtol=rtol, atol=atol)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.001, atol=0.001

Mismatched elements: 50 / 50 (100%)
Max absolute difference: 7.249808
Max relative difference: 37.313366
 x: array([[[[-3.723791,  4.639688, -0.084135,  1.286607,  5.300301],
         [ 2.492521, -4.99145 , -2.470633,  0.893196, -1.087579],
         [-0.640068, -1.888586, -1.97117 , -0.046645, -2.467756],...
 y: array([[[[ 2.861494, -1.560436,  2.371599,  6.006711,  1.072686],
         [-3.419539, -2.064956,  2.353943,  0.867886,  0.583497],
         [-3.662952,  0.055854, -2.707001, -3.451663,  2.091815],...
-------------------- >> begin captured logging << --------------------

command to run test:

nosetests-3.4 --with-coverage --cover-inclusive --cover-xml --cover-branches --cover-package=mxnet --with-timer --timer-ok 1 --timer-warning 15 --timer-filter warning,error --with-xunit --xunit-file nosetests_unittest.xml --verbose tests/python/unittest/test_operator.py:test_convolution_independent_gradients

Zha0q1 commented 3 years ago

@szha @anko-intel @josephevans I have collected more info about the two test failures. I've also prepared a clean ec2 instance with this specific mxnet build so that others can quickly get access. Please let me know about anything to try or if I can add your secret to the instance. Help is much appreciated!

cfRod commented 3 years ago

Hi @Zha0q1, could you provide this container and how to build it.

cfRod commented 3 years ago

Hi @Zha0q1 , can you confirm that gemm:acl is called for the tests on grouped convolutions? In my logs I am seeing it fallback to reference kernels (gemm:ref) when it performs the following check:https://github.com/oneapi-src/oneDNN/blob/83ebc40d86bc54f0f23e947235e53570eeacf254/src/cpu/aarch64/acl_gemm_convolution_utils.cpp#L72

    // Compute Library unsupported shape scenarios
    if (one_of(true, is_3d, is_1d, with_groups)) {
        return status::unimplemented;
    }

Zha0q1 commented 3 years ago

Hi @cfRod Thanks for taking a look at this! Yeah I will prepare a docker image for you today. I am not sure about which kernel it was using -- our observation was that (MXNet + OneDNN on aarch64) had no issues whatsoever, and (MXNet + OneDNN + ACL) would fail the two tests, so it was related to the ACL integration

cfRod commented 3 years ago

Hi @Zha0q1 , I can confirm that I've built MXNet with oneDNN and ACL and the behaviour I observed is that while running this test it fallbacks to reference kernels and the test passes. The fallback to onednn reference is because of the check in the ACL integration layer (as I linked above).

ldd libmxnet.so
  1         linux-vdso.so.1 (0x0000ffffa7a60000)
  2         libdl.so.2 => /lib64/libdl.so.2 (0x0000ffffa1f30000)
  3         libdnnl.so.2 => /home/ubuntu/mxnet/mxnet/build/3rdparty/mkldnn/src/libdnnl.so.2 (0x0000ffffa1600000)
  4         libpthread.so.0 => /lib64/libpthread.so.0 (0x0000ffffa15c0000)
  5         libarm_compute.so => /home/ubuntu/mxnet/ComputeLibrary/build/libarm_compute.so (0x0000ffffa1230000)
  6         libarm_compute_graph.so => /home/ubuntu/mxnet/ComputeLibrary/build/libarm_compute_graph.so (0x0000ffffa1130000)
  7         libarm_compute_core.so => /home/ubuntu/mxnet/ComputeLibrary/build/libarm_compute_core.so (0x0000ffffa08b0000)
  8         libopenblas.so.0 => /home/ubuntu/mxnet/openblas/0.3.9/lib/libopenblas.so.0 (0x0000ffff9fe70000)

Yeah I will prepare a docker image for you today.

Cheers that would be great.

Zha0q1 commented 3 years ago

Hi @cfRod thanks for you reply! Hmm that's weird, anyways I am creating a docker now, will share it shortly

Zha0q1 commented 3 years ago

@cfRod I was able to reproduce the test failure within a docker container. I uploaded the image to here public.ecr.aws/y2l8i4q9/acl_debug:latest. Please let me know if you are able to pull & reproduce, thanks!

test_operator.test_convolution_grouping 
test_operator.test_convolution_independent_gradients

p.s. The second test would cause a crash for some reason, but this was not an issue in my non-docker ubuntu enviroment. I think for now we can start with the first test.

/usr/local/gcc-8.5.0/include/c++/8.5.0/bits/stl_vector.h:950: std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) const [with _Tp = mxnet::OpReqType; _Alloc = std::allocator<mxnet::OpReqType>; std::vector<_Tp, _Alloc>::const_reference = const mxnet::OpReqType&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
Aborted (core dumped)

cfRod commented 3 years ago

@Zha0q1, Thanks I was able to pull and reproduce your issue. Update on my build: I was also able to reproduce it on my version of the build. I will look into this.

cfRod commented 3 years ago

Hi @Zha0q1 ,

This is my WIP analysis so far (see verbose logs to support this) The convolution with grouping is calling into gemm reference kernels, which is expected as grouped convolutions is unsupported by Compute library at the moment. The case of manually slicing the input and running the convolution operation on the two slices of input and weights, here it is calling gemm:acl and the first half of the output is matches the output from the reference kernel but the second half doesn't. My initial analysis is that the primitive is created once and reused twice with different inputs and weights (see logs below). However Compute library assumes weights are immutable, which is because it does some preprocessing internally of the weights once for every new primitive creation and but in this case the second time around the new sets of weights are not used by ACL.

/home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated.
[14:52:42] /home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated.
dnnl_verbose,info,oneDNN v2.0.0 (commit 83ebc40d86bc54f0f23e947235e53570eeacf254)
dnnl_verbose,info,cpu,runtime:OpenMP
dnnl_verbose,info,cpu,isa:Generic
dnnl_verbose,info,gpu,runtime:none

# THIS IS GEMM REFFERENCE

dnnl_verbose,create:cache_miss,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.130859
dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,19.759

# THIS IS GEMM ACL with sliced inputs
dnnl_verbose,create:cache_miss,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.0969238
dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.415039
dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.321045

# THIS IS THE CONCAT STAGE
dnnl_verbose,create:cache_miss,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.0700684
dnnl_verbose,exec,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.468994

We've seen a similar issue with primitive caching in TensorFlow and I am looking to confirm the hypothesis, possibly by modifying the test.

Zha0q1 commented 3 years ago

@cfRod, thanks for the detailed analysis, this makes a lot of sense! Just to understand it better, is this a OneDNN & ACL integration issue (rather than MXNet & OneDNN or TensorFlow & OneDNN integration issue)? Another quick questions is that do you work with the OneDNN or Arm? I am trying to think if we need to loop in more people to fix this : )

cfRod commented 3 years ago

Hi @Zha0q1, in this case, it is an issue with oneDNN and ACL rather than a framework specific issue.

do you work with the OneDNN or Arm

Yes. I am part of the team that is working on this.

Zha0q1 commented 3 years ago

Hi @Zha0q1, in this case, it is an issue with oneDNN and ACL rather than a framework specific issue.

do you work with the OneDNN or Arm

Yes. I am part of the team that is working on this.

Thanks 👍 👍

Zha0q1 commented 3 years ago

Hi @cfRod I have been tracking your discuss on the other thread, thanks for working on this! Meanwhile were you able to reproduce the other test failure test_operator.test_convolution_independent_gradients? This test "fails fine" on my native env but crashes in the docker. I am not sure why, but I just wanted to know if you could reproduce, thanks!

Zha0q1 commented 3 years ago

Would appreciate your input @cfRod :)

cfRod commented 3 years ago

Yes, this seems to be an issue which shows up with -D_GLIBCXX_ASSERTIONS turned on. In mxnet this is turned on with -DCMAKE_BUILD_TYPE=Debug.

Is your non-docker mxnet built with -DCMAKE_BUILD_TYPE=Release?

Zha0q1 commented 3 years ago

Yes, this seems to be an issue which shows up with -D_GLIBCXX_ASSERTIONS turned on. In mxnet this is turned on with -DCMAKE_BUILD_TYPE=Debug.

Is your non-docker mxnet built with -DCMAKE_BUILD_TYPE=Release?

Right I think the non-docker build is RELEASE. So you were able to reproduce test_operator.test_convolution_independent_gradients?

cfRod commented 3 years ago

Yes, I was able to reproduce the issue outside docker with a -DCMAKE_BUILD_TYPE=Debug build.

cfRod commented 3 years ago

Hi @Zha0q1, Could you point me to all the tests that I need to run for submitting a PR on AArch64? Is it nosetests --verbose tests/python/unittest?

Zha0q1 commented 3 years ago

Hi @Zha0q1, Could you point me to all the tests that I need to run for submitting a PR on AArch64? Is it nosetests --verbose tests/python/unittest?

Yeah I think that's exactly right :)

cfRod commented 3 years ago

Related question raised here: https://github.com/apache/incubator-mxnet/discussions/20457

mseth10 commented 3 years ago

Thanks @cfRod for adding the workaround for test_convolution_grouping. Can you also update on the status of test_convolution_independent_gradients? From your comments, I gather that you were able to reproduce it with -DCMAKE_BUILD_TYPE=Debug build flag.

cfRod commented 3 years ago

Hi mseth10, I was able to reproduce the crash due to GLIBC turned on but I have not looked further on the second issue.

apache / mxnet

[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL #20265