NumPy BLAS Clashing with MXNet BLAS

Zha0q1 commented 4 years ago

Both NumPy and MXNet are dependent on BLAS. When they are linked to different BLAS libraries there will be a name clashing issue. Effectively, only functions from NumPy's BLAS will be used by both NumPy and MXNet.

According to https://stackoverflow.com/questions/47891872/how-to-use-non-mkl-numpy-under-anaconda, anaconda will by default ship MKL-dependent NumPy. This is also the case on DLAMI 30:

ubuntu@ip-172-31-40-81:~$ python3
Python 3.7.7 (default, Mar 26 2020, 15:48:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.show_config()
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/ubuntu/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/ubuntu/anaconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/ubuntu/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/ubuntu/anaconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/ubuntu/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/ubuntu/anaconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/ubuntu/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/ubuntu/anaconda3/include']
>>>

I first ran into this issue while working on adding large tensor support to linalg operators, where I used a manually built int 64 version of Open BLAS. I used this simple test script:

def run_test():
  import mxnet as mx
  from mxnet import nd

  # large tensor, only works on int 64 BLAS
  A=mx.nd.ones(shape=(1, 2**31))
  nd.linalg.syrk(A)
  nd.waitall()

if __name__ == '__main__':
    run_test()

On my machine (DLAMI 30 Ubuntu 18) Open BLAS is built with DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 USE_OPENMP=1 INTERFACE64=1 BINARY=64 NO_SHARED=0 NO_LAPACK=0 and MXNet is built with USE_BLAS="open" USE_INT64_TENSOR_SIZE=1. Numpy is pre-installed with MKL optimization.

Ideally, linalg.syrk would invoke Open BLAS cblas_ssyrk (my build, 64 bit int), but in reality because of the name clashing, MKL cblas_ssyrk (32 bit int) is called instead. This will lead to:

ubuntu@ip-172-31-40-81:~$ python test.py 
[21:58:23] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
oooof

Intel MKL ERROR: Parameter 5 was incorrect on entry to cblas_ssyrk.

Using GDB we can see we are indeed calling into MKL cblas_ssyrk:

[22:02:04] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
oooof
[Switching to Thread 0x7ffdcffff700 (LWP 22329)]

Thread 6 "python3" hit Breakpoint 1, 0x00007ffff608fe50 in cblas_ssyrk_ ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mkl/../../../libmkl_rt.so
(gdb) bt
#0  0x00007ffff608fe50 in cblas_ssyrk_ ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mkl/../../../libmkl_rt.so
#1  0x00007fffe8b10c85 in linalg_syrk<mshadow::cpu, float> (s=<optimized out>, tA=false, beta=0, alpha=1, 
    B=..., A=...) at ../src/operator/tensor/./../linalg_impl.h:983
#2  linalg_batch_syrk<mshadow::cpu, float> (s=<optimized out>, tA=false, beta=0, alpha=1, B=..., A=...)
    at ../src/operator/tensor/./../linalg_impl.h:985
#3  mxnet::op::syrk::op<mshadow::cpu, float> (s=<optimized out>, tA=false, beta=0, alpha=1, B=..., A=...)
    at ../src/operator/tensor/./la_op-inl.h:340
#4  mxnet::op::syrk::op<mshadow::cpu, float> (attrs=..., s=<optimized out>, B=..., A=...)
    at ../src/operator/tensor/./la_op-inl.h:350
#5  mxnet::op::syrk::op<mshadow::cpu, float> (attrs=..., ctx=..., B=..., A=...)
    at ../src/operator/tensor/./la_op-inl.h:356
#6  mxnet::op::LaOpCaller<mshadow::cpu, float, 2, 2, 1, 1, mxnet::op::syrk>::op (axis=-2, ctx=..., 
    attrs=..., outputs=..., inputs=...) at ../src/operator/tensor/./la_op.h:560
#7  mxnet::op::LaOpForward<mshadow::cpu, 2, 2, 1, 1, mxnet::op::syrk> (attrs=..., ctx=..., inputs=..., 
    req=..., outputs=...) at ../src/operator/tensor/./la_op.h:671
#8  0x00007fffe56ed740 in std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) const (__args#4=std::vector of length 1, capacity 1 = {...}, 
    __args#3=std::vector of length 1, capacity 1 = {...}, 
    __args#2=std::vector of length 1, capacity 1 = {...}, __args#1=..., __args#0=..., this=0x555556371c38)
    at /usr/include/c++/7/bits/std_function.h:706
#9  mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const (
    __closure=0x555556371bb0, rctx=...) at ../src/imperative/./imperative_utils.h:494

Reinstalling NumPy and linking it to my Open BLAS build resolved the issue for me.

So the problem with this name clashing issue is that regardless of what BLAS we build MXNet with, we are stuck with the BLAS that NumPy is configured to use. While in most cases, such as supporting large tensor i.e. 64-bit indexing, it's fine to configure them to use the same BLAS lib (int 64 Open BLAS in my case), I wonder if there is special use case where we actually want different BLAS for NumPy and MXNet?

My guess would be "no", but still we should be aware of this issue as well as the extra step to reconfig NumPy and MXNet to the correct BLAS, and we probably need to note so in our build tutorial

This same issue is also noted on NumPy's build-from-source page: https://numpy.org/devdocs/user/building.html. Open BLAS support building with function prefixes and suffixes and NumPy can recognize suffixes like "64_" when built with 64 bit int support. We could do something like this potentially, adding a suffix/prefix to BLAS functions and use those names in MXNet, but again it's much easier to link NumPy and MXNet to the same BLAS

leezu commented 4 years ago

The problem here is that we can never know what BLAS libraries are linked into other Python packages that our users happen to use. NumPy may be a popular one, but other packages may also link different BLAS implementations. The same issue applies to the OpenMP dependency.

Did you statically link BLAS implementation? You can refer to https://github.com/apache/incubator-mxnet/pull/17751. For dynamic linking, the loader will load the implementation referenced by the first package that is loaded and not load other BLAS implementations if the BLAS symbols are already available. Maybe importing mxnet first triggers importing numpy, causing the numpy BLAS implementation to be loaded?

szha commented 4 years ago

is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.

Zha0q1 commented 4 years ago

is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.

This if for building from source. Are we planing to support large tensors by default in the future? Currently it's controlled by a flag and I think the distributed whl's don't have it turned on (I just joined the effort to support large tensors for v1.x and 2.0.)

Zha0q1 commented 4 years ago

The problem here is that we can never know what BLAS libraries are linked into other Python packages that our users happen to use. NumPy may be a popular one, but other packages may also link different BLAS implementations. The same issue applies to the OpenMP dependency.

Did you statically link BLAS implementation? You can refer to #17751. For dynamic linking, the loader will load the implementation referenced by the first package that is loaded and not load other BLAS implementations if the BLAS symbols are already available. Maybe importing mxnet first triggers importing numpy, causing the numpy BLAS implementation to be loaded?

Right it was a dynamic link. Is there a way to link it statically when building from source?

access2rohit commented 4 years ago

@Zha0q1 When building pypi wheel. openBLAS is linked statically and is inside libmxnet.so. So for dynamic inking we can layout guidelines for users saying that if they want to build from source they need to reinstall numpy with openblas built from source.

We can change the build from source instructions for mxnet.2.0 to always build openBLAS from source and statically link it. @leezu @szha what do you think ?

Zha0q1 commented 4 years ago

Alright I tried the static build scripts under /tools This is what I got

ubuntu@ip-172-31-12-248:~/incubator-mxnet$ ldd lib/libmxnet.so 
    linux-vdso.so.1 (0x00007fff525c4000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb1116b7000)
    libopenblas.so.0 => /home/ubuntu/incubator-mxnet/lib/libopenblas.so.0 (0x00007fb10efc4000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb10edbc000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb10eb9d000)
    libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fb10e96e000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb10e5e5000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb10e247000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb10e02f000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb10dc3e000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fb1192fc000)
    libgfortran.so.4 => /home/ubuntu/incubator-mxnet/lib/libgfortran.so.4 (0x00007fb10d85f000)
    libquadmath.so.0 => /home/ubuntu/incubator-mxnet/lib/libquadmath.so.0 (0x00007fb10d61f000)

It looks like libmxnet is still dependent on libopenblas. Am I doing this wrong or should I config the script in some ways? @leezu

leezu commented 4 years ago

As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions start to provide libopenblas64_.so versions with OpenBLAS ILP64 build and 64_ suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965. Debian testing / Ubuntu 20.04 provide libopenblas64.so without suffix per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878121 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924 I initiated a discussion on debian-science mailinglist about providing the same package in Debian. It's also tracked in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=967951

To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names or static link. We can work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html We can backport the upstream recommended approach to our https://github.com/apache/incubator-mxnet/tree/master/cmake/upstream

To support older distributions that do not ship with libopenblas64_.so, we should include instructions for building libopenblas64_.so from source in our guide.

Static linking is a great solution for the pip distribution build, but I don't think we should settle on a solution that requires static linking, because there are already better solution established as best practice (ie. symbol suffix) and this will restrict the versatility of MXNet source build and make it harder to package MXNet as part of Debian or other distributions.

access2rohit commented 4 years ago

As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions start to provide libopenblas64_.so versions with OpenBLAS ILP64 build and 64_ suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965. Debian testing / Ubuntu 20.04 provide libopenblas64.so without suffix per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878121 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924 I initiated a discussion on debian-science mailinglist about providing the same package in Debian. It's also tracked in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=967951

To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names or static link. We can work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html We can backport the upstream recommended approach to our https://github.com/apache/incubator-mxnet/tree/master/cmake/upstream

To support older distributions that do not ship with libopenblas64_.so, we should include instructions for building libopenblas64_.so from source in our guide.

Static linking is a great solution for the pip distribution build, but I don't think we should settle on a solution that requires static linking, because there are already better solution established as best practice (ie. symbol suffix) and this will restrict the versatility of MXNet source build and make it harder to package MXNet as part of Debian or other distributions.

@leezu Thanks for initiating the discussion on debian science. My solution was to have build instructions for everyone to build openblas from source(static linking would avoid issues for cloud users using DLAMI on EC2 that causes issues with openBLAS installation). That way build instructions are identical regardless of the distro selection. Most customers use pip install so for them there isn't a difference in experience.

Since distros will start to provide _64 versions of libraries we don't need to worry about those then.

leezu commented 4 years ago

My solution was to have build instructions for everyone to build openblas from source

It's great to have those instructions where required. But we should also look at the trend and build a solution that aligns with the work of others.

Since distros will start to provide _64 versions of libraries we don't need to worry about those then.

We may need to add support for those, it's not automatic. For example, https://cmake.org/cmake/help/latest/module/FindBLAS.html does not currently distinguish the 64 and 32bit versions as the maintainers may not be aware of the use-case.

This doesn't all have to happen at once. We just need to ensure that our approach remains compatible with what we eventually would like to achieve.

access2rohit commented 4 years ago

Makes sense

access2rohit commented 4 years ago

is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.

@Zha0q1 you may want to check if these steps are still there in new build scripts and add as necessary

leezu commented 4 years ago

If you want to make this work via static linking, you may need to use

-Bsymbolic
  When creating a shared library, bind references to global symbols to the 
  definition within the shared library, if any. Normally, it is possible 
  for a program linked against a shared library to override the definition 
  within the shared library. 

  This option is only meaningful on ELF platforms which support shared libraries.

But instead we can just workaround the issue by adopting the 64_ suffix convention.

szha commented 4 years ago

Discussed offline and we need to re-enable the symbol whitelisting in cmake builds. They previously existed in make-based builds: https://github.com/apache/incubator-mxnet/blob/v1.6.x/make/config/libmxnet.ver https://github.com/apache/incubator-mxnet/blob/v1.6.x/make/config/libmxnet.sym

Zha0q1 commented 4 years ago

Thanks! @szha @leezu I just looked into adding a suffix to cblas/lapack calls within our code base. There are ~50 places in a handful of different files that all need to be changed. This makes me think if this change is too heavy. Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us. 2. like @leezu mentioned finding 64 openblas is not well supported and distros also don't have a unified solution to it. For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.

Zha0q1 commented 4 years ago

BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use int 32 for stride

leezu commented 4 years ago

There are ~50 places in a handful of different files that all need to be changed. This makes me think if this change is too heavy

Why not follow the approach in numpy and define a macro at a central place? https://github.com/numpy/numpy/pull/15069/files#diff-4538717a0246e7d9363e76a2e3fc835e

finding 64 openblas is not well supported

You can edit https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindOpenBLAS.cmake. You will need to edit this file in any case, even if you chose not to rely on symbol suffixes. That's because ILP64 openblas would typically be named libopenblas64 (or libopenblas64_ with suffixes). The mid-term approach is work with upstream so that we can eventually delete FindOpenBLAS.cmake and just rely on upstream cmake feature.

For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.

There are two different cases: One for the staticbuild for pip, where static linkage will be preferred. For the general CMakeLists.txt, why should we restrict our users to static linkage?

Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us

How do you ensure this consistency without symbol suffix? Does TVM support ILP64? If TVM expects standard 32bit blas but you link ILP64 blas with the same symbol names, wouldn't there be issues?

BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use int 32 for stride

If dim is > INT_MAX is supported by MXNet, our BLAS operators need to either return the correct result or raise an error. @access2rohit told me that his PR making large tensor the default would just silently return wrong result in this case.

Zha0q1 commented 4 years ago

There are ~50 places in a handful of different files that all need to be changed. This makes me think if this change is too heavy

Why not follow the approach in numpy and define a macro at a central place? https://github.com/numpy/numpy/pull/15069/files#diff-4538717a0246e7d9363e76a2e3fc835e

finding 64 openblas is not well supported

You can edit https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindOpenBLAS.cmake. You will need to edit this file in any case, even if you chose not to rely on symbol suffixes. That's because ILP64 openblas would typically be named libopenblas64 (or libopenblas64_ with suffixes). The mid-term approach is work with upstream so that we can eventually delete FindOpenBLAS.cmake and just rely on upstream cmake feature.

For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.

There are two different cases: One for the staticbuild for pip, where static linkage will be preferred. For the general CMakeLists.txt, why should we restrict our users to static linkage?

Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us

How do you ensure this consistency without symbol suffix? Does TVM support ILP64? If TVM expects standard 32bit blas but you link ILP64 blas with the same symbol names, wouldn't there be issues?

BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use int 32 for stride

If dim is > INT_MAX is supported by MXNet, our BLAS operators need to either return the correct result or raise an error. @access2rohit told me that his PR making large tensor the default would just silently return wrong result in this case.

I created a simple PR. I am trying to learn about the scope of this change as I go

Zha0q1 commented 4 years ago

@sandeep-krishnamurthy would you take a look?

Zha0q1 commented 4 years ago

For reference, here are the things I tried:

build from source mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
build from source mxnet w/ 64 openblas no suffix + 64 openblas numpy ==> works; calls correct cblas symbol
current static build script mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
static build script mxnet w/ 64 openblas no suffix + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18872 ==> works; calls correct cblas symbol
build from source mxnet w/64 openblas suffixed + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18873 ==> works; calls correct cblas symbol

Zha0q1 commented 4 years ago

One more thing is that currently the static build script defaults to dynamically linking openblas. We would need to make it a static link; I can share my changes (build openblas 32 => 64; dynamic link => static link) once we have a final consensus on how to support openblas 64 in the future.

Also, opencv would fail to link with this error:

../staticdeps/lib/libopenblas.a(sormbr.o): In function `sormbr_':
sormbr.f:(.text+0x3a6): undefined reference to `_gfortran_concat_string'
sormbr.f:(.text+0x562): undefined reference to `_gfortran_concat_string'
sormbr.f:(.text+0x5d1): undefined reference to `_gfortran_concat_string'
sormbr.f:(.text+0x621): undefined reference to `_gfortran_concat_string'
../staticdeps/lib/libopenblas.a(sormlq.o): In function `sormlq_':
sormlq.f:(.text+0x30b): undefined reference to `_gfortran_concat_string'
../staticdeps/lib/libopenblas.a(sormlq.o):sormlq.f:(.text+0x842): more undefined references to `_gfortran_concat_string' follow
../staticdeps/lib/libopenblas.a(slartg.o): In function `slartg_':
slartg.f:(.text+0xb1): undefined reference to `_gfortran_pow_r4_i8'
../staticdeps/lib/libopenblas.a(dlartg.o): In function `dlartg_':
dlartg.f:(.text+0xb3): undefined reference to `_gfortran_pow_r8_i8'
../staticdeps/lib/libopenblas.a(slaed6.o): In function `slaed6_':
slaed6.f:(.text+0x109): undefined reference to `_gfortran_pow_r4_i8'
../staticdeps/lib/libopenblas.a(dlaed6.o): In function `dlaed6_':
dlaed6.f:(.text+0x10e): undefined reference to `_gfortran_pow_r8_i8'
collect2: error: ld returned 1 exit status

For my poc build I had to turn off opencv for the mxnet build.

szha commented 4 years ago

One more thing is that currently the static build script defaults to dynamically linking openblas.

There's a step that disable/remove shared object for dynamic linking.

Zha0q1 commented 4 years ago

One more thing is that currently the static build script defaults to dynamically linking openblas.

There's a step that disable/remove shared object for dynamic linking.

Yes, I did that on my machine and was able to build the wheel. I think we should add that back too

apache / mxnet

NumPy BLAS Clashing with MXNet BLAS #18855