Closed szha closed 7 years ago
will check
for /usr/local , , you could check in make/config.mk
MKLML_ROOT=/home/lingyan/mklml
because server is usually mulit-user , several user share the /usr/local is also very common
In the whole build process, prepare_mkl.sh
is the only part that requires root access. Having to run as root
just for this part doesn't seem to be the best idea. That said, let's focus on making the python API work first.
use latest master , do not have segment fault
(mxnet_env) lingyan@aocl-server:~/intel_mxnet/mxnet$ nosetests --verbose ./tests/python/unittest/ test_attr.test_attr_basic ... ERROR test_attr.test_operator ... ok test_attr.test_list_attr ... FAIL test_attr.test_attr_dict ... ERROR test_executor.test_bind ... ok test_executor.test_dot ... ok test_executor.test_reshape ... ok test_infer_shape.test_mlp2_infer_shape ... ok test_infer_shape.test_mlp2_infer_error ... ok test_infer_shape.test_backward_infer ... ERROR test_init.test_default_init ... FAIL test_init.test_variable_init ... ERROR test_init.test_aux_init ... ok test_io.test_MNISTIter ... --2017-02-21 12:27:02-- http://data.mxnet.io/mxnet/data/mnist.zip Resolving child-prc.intel.com (child-prc.intel.com)... 10.239.4.80 Connecting to child-prc.intel.com (child-prc.intel.com)|10.239.4.80|:913... connected. Proxy request sent, awaiting response... 200 OK Length: 11595270 (11M) [application/zip]
but if I run directly , it will have memory issue erver:~/intel_mxnet/mxnet/tests/python/unittest$ python test_init.py Error in `python': free(): invalid pointer: 0x0000000000fdd730 ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc4c03fb7e5] /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7fc4c0403e0a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fc4c040798c] /home/lingyan/mxnet_env/local/lib/python2.7/site-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7storage16CPUDeviceStorage4FreeEPv+0x18)[0x7fc41cdc9020] /home/lingyan/mxnet_env/local/lib/python2.7/site-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7storage19NaiveStorageManagerINS0_16CPUDeviceStorageEE4FreeEPvm+0x20)[0x7fc41cdca8d0] /home/lingyan/mxnet_env/local/lib/python2.7/site-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet11StorageImpl4FreeENS_7Storage6HandleE+0x90)[0x7fc41cdc831a]
On my side, I saw that the libmxnet.so
links against the following .so's:
Dynamic section at offset 0x11c69f0 contains 32 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libiomp5.so]
0x0000000000000001 (NEEDED) Shared library: [libmklml_intel.so]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
Note that it's linked against libmklml_intel.so
instead of libmklml_gnu.so
, even though I'm using gcc
on a ubuntu. Not sure if it helps.
for link issue, it should be ok , the same to me ldd libmxnet.so linux-vdso.so.1 => (0x00007fffcd155000) libmklml_intel.so => not found libiomp5.so => not found
@zhenlinluo
Thanks @glingyan . Is this passing all python unit tests on your end? It's still failing on my side, on test test_module.test_save_load
I'm seeing this error:
nnvm/include/nnvm/tuple.h:449: Check failed: dim == ndim() (4 vs. 3) dimension do not match target dimension 4 vs 3
I'm using this commit https://github.com/glingyan/mxnet/tree/54f8c5f0b89aaf513e4a8b9ee17eefafdec27cf0
I get the same result with you but I disable mkl and also get the same result , it should not a mkl issue
I get the same result with you but I disable mkl and also get the same result , it should not a mkl issue
I just verified this claim but it doesn't seem to be the case. I built a version with MKL disabled and all tests passed.
nosetests --verbose tests/python/unittest/
libdc1394 error: Failed to initialize libdc1394
test_attr.test_attr_basic ... ok
test_attr.test_operator ... ok
test_attr.test_list_attr ... ok
test_attr.test_attr_dict ... ok
test_executor.test_bind ... ok
test_executor.test_dot ... ok
test_executor.test_reshape ... ok
test_infer_shape.test_mlp2_infer_shape ... ok
test_infer_shape.test_mlp2_infer_error ... ok
test_infer_shape.test_backward_infer ... ok
test_init.test_default_init ... ok
test_init.test_variable_init ... ok
test_init.test_aux_init ... ok
test_io.test_MNISTIter ... --2017-02-21 06:41:47-- http://data.mxnet.io/mxnet/data/mnist.zip
Resolving data.mxnet.io (data.mxnet.io)... 54.208.175.7
Connecting to data.mxnet.io (data.mxnet.io)|54.208.175.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11595270 (11M) [application/zip]
Saving to: ‘data/mnist.zip’
100%[=================================================================================================================================================================================================================================================>] 11,595,270 72.9MB/s in 0.2s
2017-02-21 06:41:47 (72.9 MB/s) - ‘data/mnist.zip’ saved [11595270/11595270]
Archive: mnist.zip
inflating: t10k-images-idx3-ubyte
inflating: t10k-labels-idx1-ubyte
inflating: train-images-idx3-ubyte
inflating: train-labels-idx1-ubyte
[06:41:49] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784)
ok
test_io.test_Cifar10Rec ... ok
test_io.test_NDArrayIter ... ok
single key-value pair push & pull ... ok
test init ... ok
list key-value pair push & pull ... ok
aggregate value on muliple devices ... ok
updater ... ok
test_kvstore.test_get_type ... ok
test_model_parallel.test_chain ... ok
test_module.test_module_layout ... ok
test_module.test_save_load ... ok
test_module.test_module_reshape ... ok
test_multi_device_exec.test_ctx_group ... ok
test_ndarray.test_ndarray_setitem ... ok
test_ndarray.test_ndarray_elementwise ... ok
test_ndarray.test_ndarray_elementwisesum ... ok
test_ndarray.test_ndarray_negate ... ok
test_ndarray.test_ndarray_choose ... ok
test_ndarray.test_ndarray_fill ... ok
test_ndarray.test_ndarray_onehot ... [06:41:50] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead.
[06:41:50] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead.
[06:41:50] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead.
ok
test_ndarray.test_ndarray_copy ... ok
test_ndarray.test_ndarray_scalar ... ok
test_ndarray.test_ndarray_pickle ... ok
test_ndarray.test_ndarray_saveload ... ok
test_ndarray.test_ndarray_slice ... ok
test_ndarray.test_ndarray_crop ... ok
test_ndarray.test_ndarray_concatenate ... ok
test_ndarray.test_clip ... ok
...
And the rebuilt libmxnet.so
links with these:
Dynamic section at offset 0x12ea9f0 contains 30 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgomp.so.1]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
0x000000000000000c (INIT) 0x39320
it is strange , I will double check
confirm , mxnet have no problem , will raise a fix to pass all test case
(mxnet_env) lingyan@aocl-server:~/intel_mxnet/mxnet/python$ nosetests --verbose ../tests/python/unittest/ test_attr.test_attr_basic ... ok test_attr.test_operator ... ok test_attr.test_list_attr ... ok test_attr.test_attr_dict ... ok test_executor.test_bind ... ok test_executor.test_dot ... ok test_executor.test_reshape ... ok test_infer_shape.test_mlp2_infer_shape ... ok test_infer_shape.test_mlp2_infer_error ... ok test_infer_shape.test_backward_infer ... ok test_init.test_default_init ... ok test_init.test_variable_init ... ok test_init.test_aux_init ... ok test_io.test_MNISTIter ... [16:06:18] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784) ok test_io.test_Cifar10Rec ... ok test_io.test_NDArrayIter ... ok single key-value pair push & pull ... ok test init ... ok list key-value pair push & pull ... ok aggregate value on muliple devices ... ok updater ... ok test_kvstore.test_get_type ... ok test_model_parallel.test_chain ... ok test_module.test_module_layout ... ok test_module.test_save_load ... ok test_module.test_module_reshape ... ok test_multi_device_exec.test_ctx_group ... ok test_ndarray.test_ndarray_setitem ... ok test_ndarray.test_ndarray_elementwise ... ok test_ndarray.test_ndarray_elementwisesum ... ok test_ndarray.test_ndarray_negate ... ok test_ndarray.test_ndarray_choose ... ok test_ndarray.test_ndarray_fill ... ok test_ndarray.test_ndarray_onehot ... [16:06:21] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead. [16:06:21] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead. [16:06:21] src/ndarray/./ndarray_function-inl.h:68: The operator onehot_encode is deprecated; use one_hot instead. ok test_ndarray.test_ndarray_copy ... ok test_ndarray.test_ndarray_scalar ... ok test_ndarray.test_ndarray_pickle ... ok test_ndarray.test_ndarray_saveload ... ok test_ndarray.test_ndarray_slice ... ok test_ndarray.test_ndarray_crop ... ok test_ndarray.test_ndarray_concatenate ... ok test_ndarray.test_clip ... ok test_ndarray.test_dot ... ok test_ndarray.test_reduce ... ok test_ndarray.test_broadcast ... ok test_ndarray.test_broadcast_binary ... ok test_ndarray.test_arange ... ok test_ndarray.test_order ... ok test_ndarray.test_ndarray_equal ... ok test_ndarray.test_ndarray_not_equal ... ok test_ndarray.test_ndarray_greater ... ok test_ndarray.test_ndarray_greater_equal ... ok test_ndarray.test_ndarray_lesser ... ok test_ndarray.test_ndarray_lesser_equal ... ok test_ndarray.test_take ... ok test_operator.test_elementwise_sum ... ok test_operator.test_concat ... ok test_operator.test_slice_channel ... ok test_operator.test_regression ... ok test_operator.test_softmax ... ok test_operator.test_python_op ... ok test_operator.test_swapaxes ... ok test_operator.test_scalarop ... ok test_operator.test_scalar_pow ... ok test_operator.test_symbol_pow ... ok test_operator.test_pow_fn ... ok test_operator.test_binary_logic ... ok test_operator.test_embedding ... ok test_operator.test_binary_op_duplicate_input ... ok test_operator.test_sign ... ok test_operator.test_round_ceil_floor ... ok test_operator.test_rsqrt_cos_sin ... ok test_operator.test_maximum_minimum ... ok test_operator.test_maximum_minimum_scalar ... ok test_operator.test_abs ... ok test_operator.test_deconvolution ... MKL Build:20170209 ok test_operator.test_nearest_upsampling ... ok test_operator.test_batchnorm_training ... ok test_operator.test_convolution_grouping ... ok test_operator.test_binary_op ... ok test_operator.test_broadcast_binary_op ... ok test_operator.test_run_convolution_dilated_impulse_response ... ok test_operator.test_convolution_dilated_impulse_response ... [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization [16:06:30] src/operator/convolution.cc:40: MKLConvolutionOp Skip MKL optimization ok test_operator.test_reshape ... [16:06:30] src/operator/tensor/./matrix_op-inl.h:158: Using target_shape will be deprecated. ok test_operator.test_reduce ... ok test_operator.test_broadcast ... ok test_operator.test_transpose ... ok test_operator.test_expand_dims ... ok test_operator.test_crop ... ok test_operator.test_slice_axis ... ok test_operator.test_flip ... ok test_operator.test_stn ... ok test_operator.test_dot ... ok test_operator.test_batch_dot ... ok test_operator.test_correlation ... ok test_operator.test_support_vector_machine_l1_svm ... ok test_operator.test_support_vector_machine_l2_svm ... ok test_operator.test_roipooling ... ok test_operator.test_pad ... ok test_operator.test_instance_normalization ... ok test_operator.test_l2_normalization ... ok test_operator.test_sequence_mask ... ok test_operator.test_mathematical ... ok test_operator.test_special_functions_using_scipy ... ok test_operator.test_clip ... ok test_operator.test_init ... ok test_operator.test_order ... ok test_operator.test_blockgrad ... ok test_operator.test_take ... ok test_operator.test_grid_generator ... ok test_operator.test_bilinear_sampler ... ok test_operator.test_index2d ... ok test_operator.test_cast ... ok test_operator.test_repeat ... ok test_operator.test_tile ... ok test_operator.test_one_hot ... ok test_operator.test_where ... ok test_optimizer.test_lr_wd_mult ... ok test_optimizer.test_adam ... ok test_optimizer.test_rms ... ok test_random.test_random ... ok test_recordio.test_recordio ... ok test_recordio.test_indexed_recordio ... ok test_recordio.test_recordio_pack_label ... ok test_rnn.test_rnn ... ok test_rnn.test_lstm ... ok test_rnn.test_stack ... ok test_symbol.test_symbol_basic ... ok test_symbol.test_symbol_compose ... ok test_symbol.test_symbol_copy ... ok test_symbol.test_symbol_internal ... ok test_symbol.test_symbol_pickle ... ok test_symbol.test_symbol_saveload ... ok test_symbol.test_symbol_infer_type ... ok test_symbol.test_symbol_infer_shape ... ok Test specifying shape information when constructing a variable ... ok test_symbol.test_load_000800 ... [16:07:01] src/nnvm/legacy_json_util.cc:175: Loading symbol saved by previous version v0.8.0. Attempting to upgrade... ok test_viz.test_print_summary ... ok
Ran 138 tests in 46.207s
OK
I tested the new change. The deconvolution test is still failing consistently on my side. I tried three times and all failed.
======================================================================
FAIL: test_operator.test_deconvolution
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/ubuntu/mxnet/tests/python/unittest/test_operator.py", line 735, in test_deconvolution
pad = (1,1)
File "/home/ubuntu/mxnet/tests/python/unittest/test_operator.py", line 646, in check_deconvolution_forward_backward
assert_almost_equal(out + args_grad_addto_npy[0], args_grad_addto[0].asnumpy(), rtol=1e-4, atol=1e-4)
File "somepath/test_utils.py", line 143, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 10214.023994 exceeds tolerance rtol=0.000100, atol=0.000100. Location of maximum error:(0, 0, 0, 4), a=-1.673050, b=-0.322374
a: array([[[[ 13.58003861, -7.95524284, -5.33531814, -0.20733803,
-1.67304976],
[ -4.42065098, 15.15990298, -16.49854232, 4.75020522,...
b: array([[[[ 14.02015686, -7.67520428, -5.28275585, -2.73052788,
-0.32237393],
[ -2.92478704, 14.92308807, -15.42502975, 4.12596035,...
----------------------------------------------------------------------
Ran 138 tests in 49.399s
FAILED (failures=1)
yes, please check my last comment, because mkl do not support addto mode and there is no way to workaround it
How did you get past this case in the previous comment?
just comment below code in check_deconvolution_forward_backward args_grad_addto_npy = [np.random.normal(size=s) for s in arg_shapes] args_grad_addto = [mx.nd.array(ele) for ele in args_grad_addto_npy] exe = deconv.bind(default_context(), args=args, args_grad=args_grad_addto, grad_req="add") out = exe.outputs[0].asnumpy() exe.backward(out_grad) assert_almost_equal(out + args_grad_addto_npy[0], args_grad_addto[0].asnumpy(), rtol=1e-4, atol=1e-4)
Thanks for the quick turnaround. Closing this now
@glingyan Could you a) add a check and fail when req == addto, or b) make a intermediate buffer then add the buffer to target.
@glingyan @zhenlinluo
patch is ready , under testing
https://github.com/dmlc/mxnet/pull/5144 please check this patch
LGTM. @piiswrong
BTW, @glingyan could you elaborate on how I can use these MKL-specific changes with the full MKL release? Thanks.
@szha , latest mkl operator for mxnet only support MklML small package Full MKL have its release cycle and behind MklML now
Thanks, @glingyan. Will OSX be supported? Asking as I wasn't able to get it to build on OSX with the current version.
MKLML do not have macos yet. Intel® Integrated Performance Primitives(Intel® IPP) and Intel® Math Kernel Library (Intel® MKL) for Mac OS X are only available as components in the Intel® Parallel Studio XE (IPS) for Mac OS X. All release updates from standalone Intel IPP and Intel MKL products will be also available through bundled IPS.
@glingyan I was doing some work on building a version with MKLML and discovered that mxnet python api doesn't seem to be working with MKL2017_ML turned on. Please make sure the python unit-testing is included going forward. Related prs: #4128 #4237 #4433 #4599 #5000
Environment info
Operating System: ubuntu14.04 Compiler:gcc/g++4.8.4 Package used (Python/R/Scala/Julia):python
Installed from source: https://github.com/dmlc/mxnet/tree/0fc54d345e019002b9008abf8290fa15ad072443
Python version and distribution: Python 2.7.6
Error Message:
Please paste the full error message, including stack trace.
Minimum reproducible example
Steps to reproduce
What have you tried to solve it?
And what was the thinking behind having
prepare_mkl.sh
to install to/usr/local
by default? It's by default not writeable by non-root. Should the users be usingsudo
to build? https://github.com/dmlc/mxnet/blob/0fc54d345e019002b9008abf8290fa15ad072443/make/config.mk#L82 I'd suggest that you take the auto-install out and ask users to runprepare_mkl.sh
separately (in MKL_README), so that users don't have to run as root just to build.