JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.07k stars 5.43k forks source link

ABI conflicts due to 64-bit libopenblas.so #4923

Closed stevengj closed 9 years ago

stevengj commented 10 years ago

Julia compiles OpenBLAS to libopenblas.so. This may be a problem for calling libraries that link to a system libopenblas.so, because the runtime linker may substitute Julia's version instead. The problem is that Julia's version is compiled with a 64-bit interface, which is not the default, and so if an external library calls it expecting a 32-bit interface, a crash may result.

We encountered what appears to have been this problem n @alanedelman's machine (julia.mit.edu). He recently started experiencing crashes in PyPlot.plot that, with the help of valgrind, I tracked down to apparently:

==17855== Use of uninitialised value of size 8
==17855==    at 0xA8B6890: dgemm_beta_NEHALEM (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855==    by 0xA082D72: dgemm_nn (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855==    by 0x9F558C8: cblas_dgemm (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855==    by 0x16430CA5: dotblas_matrixproduct (_dotblas.c:809)
==17855==    by 0x14BAB5D4: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)

Apparently, Matplotlib is calling OpenBLAS (via NumPy: _dotblas.c is a NumPy file) with the 32-bit interface, but is getting linked at runtime into Julia's openblas library, which is compiled with a 64-bit interface. Recompiling Julia and openblas with USE_BLAS64=0 worked around the problem, but it would be better to avoid the conflict.

Can we just rename our libopenblas.so file to avoid any possible conflict in the runtime linker?

stevengj commented 10 years ago

Or is the problem worse than that? If I ccall a library that in turn calls cblas_dgemm, will it end up calling our OpenBLAS version even if it was originally linked to a completely different BLAS library (e.g. libblas.so)?

In that case, we might have to hack OpenBLAS to rename its exported functions (e.g. cblas_dgemm64 etcetera) since we changed the ABI.

stevengj commented 10 years ago

@xianyi, is there a way to tell OpenBLAS to add a prefix or suffix (e.g. 64) to all its exported symbols, to make it possible to link both the 32-bit and 64-bit ABI in the same executable?

stevengj commented 10 years ago

See also numpy/numpy#3916

StefanKarpinski commented 10 years ago

Wouldn't it make more sense to put the 64 after the cblas part – as in cblas64_dgemm?

ViralBShah commented 10 years ago

The ideal solution would be to have a separate 64-bit ABI and build both 32 and 64 bit versions in the same library.

staticfloat commented 10 years ago

@ViralBShah that is actually the best solution here. That would be wonderful!

stevengj commented 10 years ago

@StefanKarpinski, note that there is a Fortran dgemm ABI too, and to avoid conflicts you need to rename both C and Fortran (unless we are not linking the Fortran ABI?). But I don't think it really matters what the name looks like, as long as there is a simple deterministic rule and it can be implemented as automatically as possible in the openblas source code. I was just thinking that a suffix might be easier to automate for both C and Fortran ABIs.

ViralBShah commented 10 years ago

Currently we use the fortran abi only.

ViralBShah commented 10 years ago

I wonder if we can somehow make matplotlib use its own blas. While we may be able to do all sorts of gymnastics with openblas, it will be difficult to do the same with vendor provided BLAS.

stevengj commented 10 years ago

The other alternative would be to recompile our own numpy, but that makes installing PyCall much more of pain.

stevengj commented 10 years ago

@ViralBShah, does MKL provide the 64-bit ABI?

StefanKarpinski commented 10 years ago

The other alternative would be to recompile our own numpy, but that makes installing PyCall much more of pain.

Not to mention that the amount of stuff we compile ourselves is getting slightly ridiculous. But it's hard to avoid.

ViralBShah commented 10 years ago

I believe MKL does have a 64-bit ABI - but not 100% sure. @andreasnoackjensen ?

ViralBShah commented 10 years ago

I thought about recompiling numpy, but that is even more inconvenient.

andreasnoack commented 10 years ago

I am not sure what exactly ABI mean, but MKL has 32 bit integers in the lp64 libraries and 64 bit integers in the ilp64 libraries. The symbols have the same names.

xianyi commented 10 years ago

It's easy to add a prefix or suffix for 64-bit (ilp64) ABI. However, I am not sure OpenBLAS can support lp64 and ilp64 in one binary.

For MKL, you need link the application with different interface layer library, e.g. libmkl_intel_lp64.so or libmkl_intel_ilp64.so.

stevengj commented 10 years ago

I think adding a prefix or suffix to the ilp64 OpenBLAS interface would already be a big help. @xianyi, assuming that such a suffix were added, what would go wrong if both the 32- and 64-bit OpenBLAS libraries were linked simultaneously?

stevengj commented 10 years ago

@xianyi, is there any hope of progress on this?

ViralBShah commented 10 years ago

Would naming the 64-bit version something like libopenblas_ilp64.so solve this?

stevengj commented 10 years ago

@ViralBShah, I'm not sure, but I doubt it. If you load two shared libraries which export the same symbol (e.g. dgemm_) but with a different ABI, aren't there still going to be conflicts even if the libraries have different names? (At least if the libraries are loaded with RTLD_GLOBAL?)

ViralBShah commented 10 years ago

The easier thing then for now would be to just use the 32-bit version of openblas with IJulia, if that works.

stevengj commented 10 years ago

Nassty 32-bit limits, we hates them forever!

Anyway, it's not just IJulia, since PyCall and Numpy can be used anywhere. And 32-bit vector size limits cause their own problems.

tkelman commented 10 years ago

+1, we ran into a very similar issue here too: https://github.com/JuliaOpt/Ipopt.jl/issues/1#issuecomment-37556837 This was an instance of (here dcopy_ instead of cblas_dgemm, but same idea)

If I ccall a library that in turn calls cblas_dgemm, will it end up calling our OpenBLAS version even if it was originally linked to a completely different BLAS library (e.g. libblas.so)?

Any library linking to any LP64 shared library Blas/Lapack/etc can run into name shadowing and segfaults or other incorrect behavior when ccalled by Julia due to ILP64 openblas. Statically linking LP64 reference blas/lapack into the dependency library solves the issue in the case of Ipopt, but is not an ideal solution.

Since #5291 was merged there are now a handful of calls to cblas functions, otherwise I was going to suggest we could try co-opting OpenBlas' mechanism for handling trailing underscores as a potential way of attempting this.

stevengj commented 10 years ago

We could always just patch the openblas source with a global s/cblas/jl_cblas/ substitution.

mlubin commented 10 years ago

Isn't this mostly a visibility issue? Can we restrict openblas's symbols to not be visible to dlopen'ed shared libraries?

stevengj commented 10 years ago

@mlubin, you're right that this would be the simplest option, if we can do it on all the relevant platforms. Is there a magic linker flag for this (analogous to RTLD_LOCAL in dlopen)?

pao commented 10 years ago

Looks like if you want to avoid patching you need to use a linker script.

stevengj commented 10 years ago

@pao, it looks like the link you found is for preventing some symbols from being exported at all. That's not what we want here. We want to export symbols to Julia, but not re-export them to other shared libraries.

pao commented 10 years ago

Ah, sorry, I didn't catch that subtlety from @mlubin's comment; I see it now. I'm not deep enough on visibility to know whether that's even possible, though a cursory search didn't turn anything up.

tkelman commented 10 years ago

This looks relevant. Some combination of -Bsymbolic or -Bsymbolic-functions, and/or creating wrappers ourselves with a prefix/suffix on the function names may work, if OpenBlas' build system can't easily be made to do what we want.

We could always just patch the openblas source with a global s/cblas/jl_cblas/ substitution.

If only. OpenBlas is full of preprocessor defines (and some perl? https://github.com/xianyi/OpenBLAS/blob/develop/exports/gensymbol looks promising) that obfuscate function naming (in particular NAME and CNAME), I'm having a hard time figuring out how it works.

Aha, looks like https://github.com/xianyi/OpenBLAS/blob/develop/Makefile.system#L776 is where NAME and CNAME are getting set.

stevengj commented 10 years ago

I was just discussing this with @jiahao, and the easiest solution seems to be to use the GNU objcopy utility to just add a prefix jl_ to all exported symbols from libopenblas after it is compiled.

That way, we don't need to hack the OpenBLAS source.

The only downside is that using Julia with MKL might be a pain, but there are probably ways around this with a @blas macro to generate the ccalls with or without the prefix.

tkelman commented 10 years ago

:+1: that sounds easier - would renaming dgemm_ to jl_dgemm_ then cause a problem for any Lapack routines that try to call dgemm_, or would objcopy fix the reference too?

there are probably ways around this with a @blas macro to generate the ccalls with or without the prefix

See also #2167 (will be needed if anyone ever wants to use MKL on Windows or Intel Fortran anywhere) and #4290. It's not very well-documented, but Matlab lets you switch Blas and Lapack via environment variables. Putting that runtime-switching (or startup, or sysimg-build-time) abstraction layer into Julia will be useful as long as it doesn't introduce a noticeable performance penalty.

jiahao commented 10 years ago

I don't think runtime switching will be possible since MKL's libraries would not have the jl_ prefixes that the compiled Julia wrapper functions would be conditioned to expect.

stevengj commented 10 years ago

@tkelman, objcopy will rename both the exported symbols and all references to them within the object code, so BLAS calls within LAPACK should not be a problem since libopenblas includes both LAPACK and BLAS. (I just double-checked this. It pretty much has to work this way, of course, for symbol renaming to be usable.)

tkelman commented 10 years ago

Another likely instance of this: https://github.com/lruthotto/MUMPS.jl/issues/2

Having to rebuild the system image to change Julia's Blas backend wouldn't be too bad.

The number of library wrapper packages that depend on Blas and Lapack is already pretty high and will continue to grow. Most of these libraries should have decent facilities for configuring them with different Blas libraries at compile time. It'll be good to standardize an approach for providing a Blas library from Julia to library packages, for performance, reducing duplication, and cross-platform uniformity (no such thing as "system Blas" on Windows, and we want our library packages to work on Windows don't we?). The LP64 vs ILP64 issue is part of this, and it may require providing an LP64 Blas library with the default function names for packages, while Julia itself uses an ILP64 Blas with prefixed function names.

ufechner7 commented 10 years ago

So is "using the GNU objcopy utility to just add a prefix jl_ to all exported symbols from libopenblas after it is compiled" a good solution? If so, what needs to be done to make it work?

stevengj commented 10 years ago

@ufechner7, two things (a) the Makefile needs to be updated to make the requisite call to objcopy and (b) base/linalg/blas.jl etcetera need to be updated to change all ccalls to BLAS and LAPACK routines with e.g. a @blascall(...) macro that prepends the jl_ prefix to the symbol (we want a macro here so that it can be easily changed, e.g. to call MKL).

tkelman commented 9 years ago

Did anyone start experimenting with this to see how feasible it is?

stevengj commented 9 years ago

Not yet, as far as I know. I only tried out objcopy to verify that it could rename the symbols.

tkelman commented 9 years ago

I tried cp libopenblas.so libjlopenblas.so; objcopy --prefix-symbols=jl_ libjlopenblas.so then

julia> n = 5; a = rand(n); b = rand(n); inca = 1; incb = 1;
julia> y = ccall((:jl_ddot_, "libjlopenblas"), Float64, (Ptr{Int}, Ptr{Float64}, Ptr{Int}, Ptr{Float64}, Ptr{Int}), &n, a, &inca, b, &incb)
ERROR: ccall: could not find function jl_ddot_ in library libjlopenblas
 in anonymous at no file

So something's missing. nm libjlopenblas.so | grep ddot does return the expected

00000000000f47b0 T jl_cblas_ddot
00000000000f3aa0 T jl_ddot_
0000000000f29200 T jl_ddot_k_ATOM
0000000000c1ce00 T jl_ddot_k_BARCELONA
0000000000dbf200 T jl_ddot_k_BOBCAT
0000000001299e00 T jl_ddot_k_BULLDOZER
00000000004b4e00 T jl_ddot_k_CORE2
0000000000703a00 T jl_ddot_k_DUNNINGTON
0000000001013400 T jl_ddot_k_NANO
0000000000808c00 T jl_ddot_k_NEHALEM
0000000000932800 T jl_ddot_k_OPTERON
0000000000aa7e00 T jl_ddot_k_OPTERON_SSE3
00000000005de000 T jl_ddot_k_PENRYN
00000000013d8000 T jl_ddot_k_PILEDRIVER
0000000000320600 T jl_ddot_k_PRESCOTT
000000000113c200 T jl_ddot_k_SANDYBRIDGE

so maybe some additional steps are required?

tkelman commented 9 years ago

On Windows there is a not-that-hard option that works, by making the following change to this file in OpenBLAS

--- exports/gensymbol   2014-08-11 20:56:12.014049400 -0700
+++ exports/jl_gensymbol        2014-08-11 20:55:22.566221200 -0700
@@ -2833,22 +2833,22 @@
     foreach $objs (@underscore_objs) {
        $uppercase = $objs;
        $uppercase =~ tr/[a-z]/[A-Z]/;
-       print "\t$objs=$objs","_  \@", $count, "\n";
+       print "\tjl_$objs=$objs","_  \@", $count, "\n";
        $count ++;
-       print "\t",$objs, "_=$objs","_  \@", $count, "\n";
+       print "\tjl_",$objs, "_=$objs","_  \@", $count, "\n";
        $count ++;
-       print "\t$uppercase=$objs", "_  \@", $count, "\n";
+       print "\tjl_$uppercase=$objs", "_  \@", $count, "\n";
        $count ++;
     }

     foreach $objs (@need_2underscore_objs) {
        $uppercase = $objs;
        $uppercase =~ tr/[a-z]/[A-Z]/;
-       print "\t$objs=$objs","__  \@", $count, "\n";
+       print "\tjl_$objs=$objs","__  \@", $count, "\n";
        $count ++;
-       print "\t",$objs, "__=$objs","__  \@", $count, "\n";
+       print "\tjl_",$objs, "__=$objs","__  \@", $count, "\n";
        $count ++;
-       print "\t$uppercase=$objs", "__  \@", $count, "\n";
+       print "\tjl_$uppercase=$objs", "__  \@", $count, "\n";
        $count ++;
     }

@@ -2857,15 +2857,15 @@

        $uppercase = $objs;
        $uppercase =~ tr/[a-z]/[A-Z]/;
-       print "\t",$objs, "_=$objs","_  \@", $count, "\n";
+       print "\tjl_",$objs, "_=$objs","_  \@", $count, "\n";
        $count ++;
-       print "\t$uppercase=$objs", "_  \@", $count, "\n";
+       print "\tjl_$uppercase=$objs", "_  \@", $count, "\n";
        $count ++;
     }

     foreach $objs (@no_underscore_objs) {
-       print "\t",$objs,"=$objs","  \@", $count, "\n";
+       print "\tjl_",$objs,"=$objs","  \@", $count, "\n";
        $count ++;
     }

My ccall test with a prefixed jl_ddot_ works with a libopenblas.dll generated based on this modification.

stevengj commented 9 years ago

@tkelman, does that rename all of the functions or just the generated ones? e.g. we also want to rename functions like openblas_set_num_threads.

tkelman commented 9 years ago

@stevengj it renames everything that's exported from the dll, including openblas_set_num_threads.

tkelman commented 9 years ago

I figured out why objcopy isn't working. It evidently can't rename dynamic symbols, unless it has learned some new tricks since http://sourceware-org.1504.n7.nabble.com/objcopy-redefine-sym-on-dynsym-section-td119610.html

[tkelman@static-host lib]$ objdump -T libjlopenblas.so | grep ddot
0000000000dbf200 g    DF .text  0000000000000591  Base        ddot_k_BOBCAT
0000000000aa7e00 g    DF .text  0000000000000569  Base        ddot_k_OPTERON_SSE3
00000000005de000 g    DF .text  0000000000000559  Base        ddot_k_PENRYN
0000000001299e00 g    DF .text  0000000000000341  Base        ddot_k_BULLDOZER
00000000004b4e00 g    DF .text  0000000000000551  Base        ddot_k_CORE2
0000000000f29200 g    DF .text  0000000000000325  Base        ddot_k_ATOM
0000000000320600 g    DF .text  0000000000000581  Base        ddot_k_PRESCOTT
0000000000808c00 g    DF .text  0000000000000591  Base        ddot_k_NEHALEM
0000000000703a00 g    DF .text  0000000000000529  Base        ddot_k_DUNNINGTON
00000000000f3aa0 g    DF .text  000000000000005d  Base        ddot_
0000000000932800 g    DF .text  000000000000056e  Base        ddot_k_OPTERON
0000000001013400 g    DF .text  0000000000000591  Base        ddot_k_NANO
00000000013d8000 g    DF .text  0000000000000341  Base        ddot_k_PILEDRIVER
0000000000c1ce00 g    DF .text  0000000000000591  Base        ddot_k_BARCELONA
000000000113c200 g    DF .text  0000000000000591  Base        ddot_k_SANDYBRIDGE
00000000000f47b0 g    DF .text  0000000000000055  Base        cblas_ddot

Anyone have any suggestions? I tried messing with some of the CNAME definitions in OpenBLAS' Makefile.system but that led to several undefined symbols, a bad mix of renamed and not-renamed functions. @xianyi any suggestions for applying a global prefix (or suffix, if that's easier) to all functions exported from the openblas shared library, on Linux and OSX?

nbecker commented 9 years ago

Would loading with RTLD_LOCAL help?

stevengj commented 9 years ago

@nbecker, this was discussed above. One obstacle to RTLD_LOCAL seems to be that we are not loading OpenBLAS with dlopen, but are rather linking libopenblas.so directly to the julia executable, so we have to figure out if there is a corresponding linker flag. I did I quick search through the man page of GNU ld and didn't see anything, but it has a zillion options and it's possible I missed something.

(This problem mainly seems to show up on GNU/Linux, so I think we need something that works with GNU ld.)

staticfloat commented 9 years ago

@stevengj I believe we are dlopen'ing OpenBLAS, albeit implicitly just by ccall'ing some BLAS function and passing Base.libblas_name in as the library handle. We could probably explicitly dlopen libblas in an initialization function somewhere and pass in RTLD_LOCAL if we want to.

tkelman commented 9 years ago

It's definitely been a problem in packages on Macs too. There's an osx.def file in OpenBLAS which gets created by the same Perl script gensymbol then linked using -Wl,-exported_symbols_list,osx.def, I can't really test that though as I don't have a Mac.

tkelman commented 9 years ago

I think I found a solution. We can't use objcopy on the shared library because it can't rename dynamic symbols, but I just tried it on the static library right before linking the .so and that works. It passes my jl_ddot_ test, anyway:

--- exports/Makefile-old        2014-08-20 20:47:51.000000000 -0700
+++ exports/Makefile    2014-08-20 20:45:16.000000000 -0700
@@ -103,7 +103,10 @@

 so : ../$(LIBSONAME)

-../$(LIBSONAME) : ../$(LIBNAME) linktest.c
+../$(LIBSONAME) : ../$(LIBNAME) linktest.c aix.def
+       rm -f prefix.def
+       for i in `cat aix.def`; do echo "$$i jl_$$i" >> prefix.def; done
+       objcopy --redefine-syms prefix.def ../$(LIBNAME)
 ifneq ($(C_COMPILER), LSB)
        $(CC) $(CFLAGS) $(LDFLAGS) -shared -o ../$(LIBSONAME) \
        -Wl,--whole-archive ../$(LIBNAME) -Wl,--no-whole-archive \

I'm using aix.def as a simple list of exported symbols. objcopy --prefix-symbols=jl_ ../$(LIBNAME) went a little overboard renaming everything in the static library (including things from libm, pthreads, libgfortran, etc), it couldn't link the .so from it afterwards.

stevengj commented 9 years ago

Great!