szha commented 7 years ago

@glingyan I was doing some work on building a version with MKLML and discovered that mxnet python api doesn't seem to be working with MKL2017_ML turned on. Please make sure the python unit-testing is included going forward. Related prs: #4128 #4237 #4433 #4599 #5000

Environment info

Operating System: ubuntu14.04 Compiler:gcc/g++4.8.4 Package used (Python/R/Scala/Julia):python

Installed from source: https://github.com/dmlc/mxnet/tree/0fc54d345e019002b9008abf8290fa15ad072443

Python version and distribution: Python 2.7.6

Error Message:

Please paste the full error message, including stack trace.

% nosetests --verbose ../tests/python/unittest/
test_attr.test_attr_basic ... ok
test_attr.test_operator ... ok
test_attr.test_list_attr ... ok
test_attr.test_attr_dict ... ok
test_executor.test_bind ... ok
test_executor.test_dot ... ok
test_executor.test_reshape ... ok
test_infer_shape.test_mlp2_infer_shape ... ok
test_infer_shape.test_mlp2_infer_error ... ok
test_infer_shape.test_backward_infer ... ok
test_init.test_default_init ... ok
test_init.test_variable_init ... ok
test_init.test_aux_init ... Segmentation fault (core dumped)

Minimum reproducible example

77c85
< USE_MKL2017 = 1
---
> USE_MKL2017 = 0
81c89
< USE_MKL2017_EXPERIMENTAL = 1
---
> USE_MKL2017_EXPERIMENTAL = 0

Steps to reproduce

build libmxnet.so by setting the flags in the minimum reproducible example
cd python
nosetests --verbose ../tests/python/unittest/

What have you tried to solve it?

tried previous release 0.9.3 and MKL also didn't work

test_attr.test_attr_basic ... ok
test_attr.test_operator ... ok
test_attr.test_list_attr ... ok
test_attr.test_attr_dict ... ok
test_executor.test_bind ... ok
test_executor.test_dot ... ok
test_executor.test_reshape ... ok
test_infer_shape.test_mlp2_infer_shape ... ok
test_infer_shape.test_mlp2_infer_error ... ok
test_infer_shape.test_backward_infer ... ok
test_init.test_default_init ... FAIL
test_init.test_variable_init ... ERROR
test_init.test_aux_init ... Segmentation fault (core dumped)

And what was the thinking behind having prepare_mkl.sh to install to /usr/local by default? It's by default not writeable by non-root. Should the users be using sudo to build? https://github.com/dmlc/mxnet/blob/0fc54d345e019002b9008abf8290fa15ad072443/make/config.mk#L82 I'd suggest that you take the auto-install out and ask users to run prepare_mkl.sh separately (in MKL_README), so that users don't have to run as root just to build.

glingyan commented 7 years ago

will check

glingyan commented 7 years ago

for /usr/local , , you could check in make/config.mk

MKL ML Library folder, need to be root for /usr/local

Change to User Home directory for standard user

For USE_BLAS!=mkl only

MKLML_ROOT=/home/lingyan/mklml

glingyan commented 7 years ago

because server is usually mulit-user , several user share the /usr/local is also very common

szha commented 7 years ago

In the whole build process, prepare_mkl.sh is the only part that requires root access. Having to run as root just for this part doesn't seem to be the best idea. That said, let's focus on making the python API work first.

glingyan commented 7 years ago

use latest master , do not have segment fault

(mxnet_env) lingyan@aocl-server:~/intel_mxnet/mxnet$ nosetests --verbose ./tests/python/unittest/ test_attr.test_attr_basic ... ERROR test_attr.test_operator ... ok test_attr.test_list_attr ... FAIL test_attr.test_attr_dict ... ERROR test_executor.test_bind ... ok test_executor.test_dot ... ok test_executor.test_reshape ... ok test_infer_shape.test_mlp2_infer_shape ... ok test_infer_shape.test_mlp2_infer_error ... ok test_infer_shape.test_backward_infer ... ERROR test_init.test_default_init ... FAIL test_init.test_variable_init ... ERROR test_init.test_aux_init ... ok test_io.test_MNISTIter ... --2017-02-21 12:27:02-- http://data.mxnet.io/mxnet/data/mnist.zip Resolving child-prc.intel.com (child-prc.intel.com)... 10.239.4.80 Connecting to child-prc.intel.com (child-prc.intel.com)|10.239.4.80|:913... connected. Proxy request sent, awaiting response... 200 OK Length: 11595270 (11M) [application/zip]

glingyan commented 7 years ago

but if I run directly , it will have memory issue erver:~/intel_mxnet/mxnet/tests/python/unittest$ python test_init.py Error in `python': free(): invalid pointer: 0x0000000000fdd730 ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc4c03fb7e5] /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7fc4c0403e0a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fc4c040798c] /home/lingyan/mxnet_env/local/lib/python2.7/site-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7storage16CPUDeviceStorage4FreeEPv+0x18)[0x7fc41cdc9020] /home/lingyan/mxnet_env/local/lib/python2.7/site-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7storage19NaiveStorageManagerINS0_16CPUDeviceStorageEE4FreeEPvm+0x20)[0x7fc41cdca8d0] /home/lingyan/mxnet_env/local/lib/python2.7/site-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet11StorageImpl4FreeENS_7Storage6HandleE+0x90)[0x7fc41cdc831a]

szha commented 7 years ago

On my side, I saw that the libmxnet.so links against the following .so's:

Dynamic section at offset 0x11c69f0 contains 32 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libiomp5.so]
 0x0000000000000001 (NEEDED)             Shared library: [libmklml_intel.so]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]

Note that it's linked against libmklml_intel.so instead of libmklml_gnu.so, even though I'm using gcc on a ubuntu. Not sure if it helps.

glingyan commented 7 years ago

for link issue, it should be ok , the same to me ldd libmxnet.so linux-vdso.so.1 => (0x00007fffcd155000) libmklml_intel.so => not found libiomp5.so => not found

glingyan commented 7 years ago

@zhenlinluo

glingyan commented 7 years ago

fixed in https://github.com/dmlc/mxnet/pull/5088

szha commented 7 years ago

Thanks @glingyan . Is this passing all python unit tests on your end? It's still failing on my side, on test test_module.test_save_load I'm seeing this error:

nnvm/include/nnvm/tuple.h:449: Check failed: dim == ndim() (4 vs. 3) dimension do not match target dimension 4 vs 3

I'm using this commit https://github.com/glingyan/mxnet/tree/54f8c5f0b89aaf513e4a8b9ee17eefafdec27cf0

glingyan commented 7 years ago

I get the same result with you but I disable mkl and also get the same result , it should not a mkl issue

szha commented 7 years ago

I get the same result with you but I disable mkl and also get the same result , it should not a mkl issue

I just verified this claim but it doesn't seem to be the case. I built a version with MKL disabled and all tests passed.

nosetests --verbose tests/python/unittest/
libdc1394 error: Failed to initialize libdc1394
test_attr.test_attr_basic ... ok
test_attr.test_operator ... ok
test_attr.test_list_attr ... ok
test_attr.test_attr_dict ... ok
test_executor.test_bind ... ok
test_executor.test_dot ... ok
test_executor.test_reshape ... ok
test_infer_shape.test_mlp2_infer_shape ... ok
test_infer_shape.test_mlp2_infer_error ... ok
test_infer_shape.test_backward_infer ... ok
test_init.test_default_init ... ok
test_init.test_variable_init ... ok
test_init.test_aux_init ... ok
test_io.test_MNISTIter ... --2017-02-21 06:41:47--  http://data.mxnet.io/mxnet/data/mnist.zip
Resolving data.mxnet.io (data.mxnet.io)... 54.208.175.7
Connecting to data.mxnet.io (data.mxnet.io)|54.208.175.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11595270 (11M) [application/zip]
Saving to: ‘data/mnist.zip’

100%[=================================================================================================================================================================================================================================================>] 11,595,270  72.9MB/s   in 0.2s

2017-02-21 06:41:47 (72.9 MB/s) - ‘data/mnist.zip’ saved [11595270/11595270]

Archive:  mnist.zip
  inflating: t10k-images-idx3-ubyte
  inflating: t10k-labels-idx1-ubyte
  inflating: train-images-idx3-ubyte
  inflating: train-labels-idx1-ubyte
[06:41:49] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784)
ok
test_io.test_Cifar10Rec ... ok
test_io.test_NDArrayIter ... ok
single key-value pair push & pull ... ok
test init ... ok
list key-value pair push & pull ... ok
aggregate value on muliple devices ... ok
updater ... ok
test_kvstore.test_get_type ... ok
test_model_parallel.test_chain ... ok
test_module.test_module_layout ... ok
test_module.test_save_load ... ok
test_module.test_module_reshape ... ok
test_multi_device_exec.test_ctx_group ... ok
test_ndarray.test_ndarray_setitem ... ok
test_ndarray.test_ndarray_elementwise ... ok
test_ndarray.test_ndarray_elementwisesum ... ok
test_ndarray.test_ndarray_negate ... ok
test_ndarray.test_ndarray_choose ... ok
test_ndarray.test_ndarray_fill ... ok
test_ndarray.test_ndarray_onehot ... [06:41:50] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead.
[06:41:50] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead.
[06:41:50] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead.
ok
test_ndarray.test_ndarray_copy ... ok
test_ndarray.test_ndarray_scalar ... ok
test_ndarray.test_ndarray_pickle ... ok
test_ndarray.test_ndarray_saveload ... ok
test_ndarray.test_ndarray_slice ... ok
test_ndarray.test_ndarray_crop ... ok
test_ndarray.test_ndarray_concatenate ... ok
test_ndarray.test_clip ... ok
...

And the rebuilt libmxnet.so links with these:

Dynamic section at offset 0x12ea9f0 contains 30 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgomp.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000c (INIT)               0x39320

glingyan commented 7 years ago

it is strange , I will double check

glingyan commented 7 years ago

confirm , mxnet have no problem , will raise a fix to pass all test case

glingyan commented 7 years ago

(mxnet_env) lingyan@aocl-server:~/intel_mxnet/mxnet/python$ nosetests --verbose ../tests/python/unittest/ test_attr.test_attr_basic ... ok test_attr.test_operator ... ok test_attr.test_list_attr ... ok test_attr.test_attr_dict ... ok test_executor.test_bind ... ok test_executor.test_dot ... ok test_executor.test_reshape ... ok test_infer_shape.test_mlp2_infer_shape ... ok test_infer_shape.test_mlp2_infer_error ... ok test_infer_shape.test_backward_infer ... ok test_init.test_default_init ... ok test_init.test_variable_init ... ok test_init.test_aux_init ... ok test_io.test_MNISTIter ... [16:06:18] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784) ok test_io.test_Cifar10Rec ... ok test_io.test_NDArrayIter ... ok single key-value pair push & pull ... ok test init ... ok list key-value pair push & pull ... ok aggregate value on muliple devices ... ok updater ... ok test_kvstore.test_get_type ... ok test_model_parallel.test_chain ... ok test_module.test_module_layout ... ok test_module.test_save_load ... ok test_module.test_module_reshape ... ok test_multi_device_exec.test_ctx_group ... ok test_ndarray.test_ndarray_setitem ... ok test_ndarray.test_ndarray_elementwise ... ok test_ndarray.test_ndarray_elementwisesum ... ok test_ndarray.test_ndarray_negate ... ok test_ndarray.test_ndarray_choose ... ok test_ndarray.test_ndarray_fill ... ok test_ndarray.test_ndarray_onehot ... [16:06:21] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead. [16:06:21] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead. [16:06:21] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead. ok test_ndarray.test_ndarray_copy ... ok test_ndarray.test_ndarray_scalar ... ok test_ndarray.test_ndarray_pickle ... ok test_ndarray.test_ndarray_saveload ... ok test_ndarray.test_ndarray_slice ... ok test_ndarray.test_ndarray_crop ... ok test_ndarray.test_ndarray_concatenate ... ok test_ndarray.test_clip ... ok test_ndarray.test_dot ... ok test_ndarray.test_reduce ... ok test_ndarray.test_broadcast ... ok test_ndarray.test_broadcast_binary ... ok test_ndarray.test_arange ... ok test_ndarray.test_order ... ok test_ndarray.test_ndarray_equal ... ok test_ndarray.test_ndarray_not_equal ... ok test_ndarray.test_ndarray_greater ... ok test_ndarray.test_ndarray_greater_equal ... ok test_ndarray.test_ndarray_lesser ... ok test_ndarray.test_ndarray_lesser_equal ... ok test_ndarray.test_take ... ok test_operator.test_elementwise_sum ... ok test_operator.test_concat ... ok test_operator.test_slice_channel ... ok test_operator.test_regression ... ok test_operator.test_softmax ... ok test_operator.test_python_op ... ok test_operator.test_swapaxes ... ok test_operator.test_scalarop ... ok test_operator.test_scalar_pow ... ok test_operator.test_symbol_pow ... ok test_operator.test_pow_fn ... ok test_operator.test_binary_logic ... ok test_operator.test_embedding ... ok test_operator.test_binary_op_duplicate_input ... ok test_operator.test_sign ... ok test_operator.test_round_ceil_floor ... ok test_operator.test_rsqrt_cos_sin ... ok test_operator.test_maximum_minimum ... ok test_operator.test_maximum_minimum_scalar ... ok test_operator.test_abs ... ok test_operator.test_deconvolution ... MKL Build:20170209 ok test_operator.test_nearest_upsampling ... ok test_operator.test_batchnorm_training ... ok test_operator.test_convolution_grouping ... ok test_operator.test_binary_op ... ok test_operator.test_broadcast_binary_op ... ok test_operator.test_run_convolution_dilated_impulse_response ... ok test_operator.test_convolution_dilated_impulse_response ... [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization ok test_operator.test_reshape ... [16:06:30] src/operator/tensor/./matrix_op-inl.h:158: Using target_shape will be deprecated. ok test_operator.test_reduce ... ok test_operator.test_broadcast ... ok test_operator.test_transpose ... ok test_operator.test_expand_dims ... ok test_operator.test_crop ... ok test_operator.test_slice_axis ... ok test_operator.test_flip ... ok test_operator.test_stn ... ok test_operator.test_dot ... ok test_operator.test_batch_dot ... ok test_operator.test_correlation ... ok test_operator.test_support_vector_machine_l1_svm ... ok test_operator.test_support_vector_machine_l2_svm ... ok test_operator.test_roipooling ... ok test_operator.test_pad ... ok test_operator.test_instance_normalization ... ok test_operator.test_l2_normalization ... ok test_operator.test_sequence_mask ... ok test_operator.test_mathematical ... ok test_operator.test_special_functions_using_scipy ... ok test_operator.test_clip ... ok test_operator.test_init ... ok test_operator.test_order ... ok test_operator.test_blockgrad ... ok test_operator.test_take ... ok test_operator.test_grid_generator ... ok test_operator.test_bilinear_sampler ... ok test_operator.test_index2d ... ok test_operator.test_cast ... ok test_operator.test_repeat ... ok test_operator.test_tile ... ok test_operator.test_one_hot ... ok test_operator.test_where ... ok test_optimizer.test_lr_wd_mult ... ok test_optimizer.test_adam ... ok test_optimizer.test_rms ... ok test_random.test_random ... ok test_recordio.test_recordio ... ok test_recordio.test_indexed_recordio ... ok test_recordio.test_recordio_pack_label ... ok test_rnn.test_rnn ... ok test_rnn.test_lstm ... ok test_rnn.test_stack ... ok test_symbol.test_symbol_basic ... ok test_symbol.test_symbol_compose ... ok test_symbol.test_symbol_copy ... ok test_symbol.test_symbol_internal ... ok test_symbol.test_symbol_pickle ... ok test_symbol.test_symbol_saveload ... ok test_symbol.test_symbol_infer_type ... ok test_symbol.test_symbol_infer_shape ... ok Test specifying shape information when constructing a variable ... ok test_symbol.test_load_000800 ... [16:07:01] src/nnvm/legacy_json_util.cc:175: Loading symbol saved by previous version v0.8.0. Attempting to upgrade... ok test_viz.test_print_summary ... ok

Ran 138 tests in 46.207s

OK

glingyan commented 7 years ago

I comment below test case , for mkl do not support addto mode and there is no way to workaround it (mxnet_env) lingyan@aocl-server:~/intel_mxnet/mxnet$ git diff tests/ diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py index d94cf9a..c4adbbf 100644 --- a/tests/python/unittest/test_operator.py +++ b/tests/python/unittest/test_operator.py @@ -636,7 +636,7 @@ def check_deconvolution_forward_backward(input_shape, num_filter, kernel, stride out = exe.outputs[0].asnumpy() exe.backward(out_grad) assert_almost_equal(out, args_grad[0].asnumpy(), rtol=1E-3, atol=1e-4)

''' args_grad_addto_npy = [np.random.normal(size=s) for s in arg_shapes] args_grad_addto = [mx.nd.array(ele) for ele in args_grad_addto_npy] exe = deconv.bind(default_context(), args=args, args_grad=args_grad_addto, grad_req="add") @@ -644,7 +644,7 @@ def check_deconvolution_forward_backward(input_shape, num_filter, kernel, stride out = exe.outputs[0].asnumpy() exe.backward(out_grad) assert_almost_equal(out + args_grad_addto_npy[0], args_grad_addto[0].asnumpy(), rtol=1e-4, atol=1e-4)
'''

szha commented 7 years ago

I tested the new change. The deconvolution test is still failing consistently on my side. I tried three times and all failed.

======================================================================
FAIL: test_operator.test_deconvolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/mxnet/tests/python/unittest/test_operator.py", line 735, in test_deconvolution
    pad                 = (1,1)
  File "/home/ubuntu/mxnet/tests/python/unittest/test_operator.py", line 646, in check_deconvolution_forward_backward
    assert_almost_equal(out + args_grad_addto_npy[0], args_grad_addto[0].asnumpy(), rtol=1e-4, atol=1e-4)
  File "somepath/test_utils.py", line 143, in assert_almost_equal
    raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 10214.023994 exceeds tolerance rtol=0.000100, atol=0.000100.  Location of maximum error:(0, 0, 0, 4), a=-1.673050, b=-0.322374
 a: array([[[[ 13.58003861,  -7.95524284,  -5.33531814,  -0.20733803,
           -1.67304976],
         [ -4.42065098,  15.15990298, -16.49854232,   4.75020522,...
 b: array([[[[ 14.02015686,  -7.67520428,  -5.28275585,  -2.73052788,
           -0.32237393],
         [ -2.92478704,  14.92308807, -15.42502975,   4.12596035,...

----------------------------------------------------------------------
Ran 138 tests in 49.399s

FAILED (failures=1)

glingyan commented 7 years ago

yes, please check my last comment, because mkl do not support addto mode and there is no way to workaround it

szha commented 7 years ago

How did you get past this case in the previous comment?

glingyan commented 7 years ago

just comment below code in check_deconvolution_forward_backward args_grad_addto_npy = [np.random.normal(size=s) for s in arg_shapes] args_grad_addto = [mx.nd.array(ele) for ele in args_grad_addto_npy] exe = deconv.bind(default_context(), args=args, args_grad=args_grad_addto, grad_req="add") out = exe.outputs[0].asnumpy() exe.backward(out_grad) assert_almost_equal(out + args_grad_addto_npy[0], args_grad_addto[0].asnumpy(), rtol=1e-4, atol=1e-4)

szha commented 7 years ago

Thanks for the quick turnaround. Closing this now

piiswrong commented 7 years ago

@glingyan Could you a) add a check and fail when req == addto, or b) make a intermediate buffer then add the buffer to target.

szha commented 7 years ago

@glingyan @zhenlinluo

glingyan commented 7 years ago

patch is ready , under testing

glingyan commented 7 years ago

https://github.com/dmlc/mxnet/pull/5144 please check this patch

szha commented 7 years ago

LGTM. @piiswrong

szha commented 7 years ago

BTW, @glingyan could you elaborate on how I can use these MKL-specific changes with the full MKL release? Thanks.

glingyan commented 7 years ago

@szha , latest mkl operator for mxnet only support MklML small package Full MKL have its release cycle and behind MklML now

szha commented 7 years ago

Thanks, @glingyan. Will OSX be supported? Asking as I wasn't able to get it to build on OSX with the current version.

glingyan commented 7 years ago

MKLML do not have macos yet. Intel® Integrated Performance Primitives(Intel® IPP) and Intel® Math Kernel Library (Intel® MKL) for Mac OS X are only available as components in the Intel® Parallel Studio XE (IPS) for Mac OS X. All release updates from standalone Intel IPP and Intel MKL products will be also available through bundled IPS.

apache / mxnet

MKL2017 Problem #5085

Environment info

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

MKL ML Library folder, need to be root for /usr/local

Change to User Home directory for standard user

For USE_BLAS!=mkl only