JuliaPackaging / BinaryBuilder.jl

Binary Dependency Builder for Julia
https://binarybuilder.org
Other
386 stars 99 forks source link

Conflict between MKL_jll and libgomp in CompilerSupportLibraries_jll #700

Open KristofferC opened 4 years ago

KristofferC commented 4 years ago

Today I was tracking down the test failure observed in https://github.com/JuliaMath/IntelVectorMath.jl/pull/39 when moving to MKL_jll that occurred on a Mac.

After a while, with some help from @giordano, an MWE is:

# dlopen the libgomp inside CompilerSupportLibraries_jll (libgfortran 5). 
using Libdl
dlopen("/Users/kristoffercarlsson/.julia/artifacts/411c17f069dd12b3f79b19e4647949050a58223c/lib/libgomp.1.dylib")

using MKL_jll

gamma1(x) = (x = copy(x);    ccall((:vdTGamma, libmkl_rt), Cvoid, (Int32, Ptr{Float64}, Ptr{Float64}), length(x), x, x); x);
gamma2(x) = (y = similar(x); ccall((:vdTGamma, libmkl_rt), Cvoid, (Int32, Ptr{Float64}, Ptr{Float64}), length(x), x, y); y);
INPUT_2  = rand(1000);

gamma1(INPUT_2)[1:3]
gamma2(INPUT_2)[1:3]

the result from gamma1 and gamma2 should give the same answers, but they don't, as an example

julia> gamma1(INPUT_2)[1:3]
3-element Array{Float64,1}:
 Inf
 Inf
 Inf

julia> gamma2(INPUT_2)[1:3]
3-element Array{Float64,1}:
 16.209046961478283
  6.570129954697695
 41.08264424860374

(This doesn't seem to happen 100% of the time but the majority of the times).

This only happens if the length of the input vector is a certain size, cf

julia> gamma1(INPUT_2[1:602])[1:3] # ok
3-element Array{Float64,1}:
 1.0650782774656853
 1.4718294693439533
 1.1884483595544906

julia> gamma2(INPUT_2[1:602])[1:3] # ok
3-element Array{Float64,1}:
 1.0650782774656853
 1.4718294693439533
 1.1884483595544906

julia> gamma1(INPUT_2[1:603])[1:3] # bad
3-element Array{Float64,1}:
 0.9663910468243027
 0.8856476139287297
 0.9213177809196278

julia> gamma2(INPUT_2[1:603])[1:3] # ok
3-element Array{Float64,1}:
 1.0650782774656853
 1.4718294693439533
 1.1884483595544906

The cutoff seems to be when the length of the vector is 603. My guess is that MKL thinks this is a long enough vector to perhaps enable threading? Also, it is important that libgomp is loaded before MKL_jll for the bug to occur.

The versions used are:

  [e66e0078] CompilerSupportLibraries_jll v0.2.0+1
  [856f044c] MKL_jll v2020.0.166+0

and

julia> versioninfo()
Julia Version 1.4.0-rc2.0
Commit b99ed72c95 (2020-02-24 16:51 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.0.0)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
KristofferC commented 4 years ago

Updated to reflect that the problem is the dlopen of libgomp in CompilerSupportLibraries_jll

staticfloat commented 4 years ago

@tlienart this is almost certainly the issue that you brought up to me recently.

The fundamental issue is that MKL has its own OMP implementation (libiomp) that exports the same symbols as libgomp. If you import CompilerSupportLibraries_jll first, you get libgomp hogging those symbols, while if you load MKL_jll first, then libiomp gets its implementations bound to those symbols.

So one workaround that should work today; call using MKL_jll first. Why won't that break other applications? Because very few other packages are actually using libgomp. We include it in CSL_jll because there are a few (And it's important to provide all the libraries), but it's a little rare for a package to actually use it. We automatically open all libraries within these packages so that dependent packages can open those libraries. While we could prevent CSL_jll from loading libgomp automatically (via the dont_dlopen=true argument) that would break packages actually trying to use libgomp, as there's no way for those dependent packages to find where libgomp is living (from the dynamic linker's perspective; from Julia's perspective of course we can find it, but communicating that to the dynamic linker is the issue).

The basic problem is that we have two libraries that are mutually incompatible; libgomp and libiomp. If you have a single application with dependencies that rely on both, this will never work. However packages like MLJ probably don't really care about libgomp and it's just coming along for the ride via CSL. To fix this, I see a few possibilities:

My favorite solution is the third, but of course it's also the most work. :P @giordano I'm open to suggestions.

tlienart commented 4 years ago

@staticfloat as a heads up, this is still bugging us and using MKL_jll does not fix the issue (see e.g. https://travis-ci.com/github/alan-turing-institute/MLJModels.jl/jobs/320357986). I'm still at loss why MLJ users would see this issue while ScikitLearn.jl would not.

The only way I've found so far to tell our users is to use my fork of 0.5.3+1 (instead of 0.5.3+3) https://github.com/tlienart/OpenSpecFun_jll.jl . Of course this is a shit workaround (and won't work on CI) but I couldn't use a compat bound on the package as the version string does not meet the standards (the +1 is throwing Pkg off)

Maybe here a 4th solution to your list would be to add a way to cap the version on such packages so that I can enforce 0.5.3+1 with which the problem does not happen.

KristofferC commented 4 years ago

I wonder if https://github.com/JuliaPackaging/Yggdrasil/issues/915 is related. It also has to do with libiomp on Mac. Does things work ok for you if you use a system installation of MKL / copy over the libmkl_intel_thread.dylib from the system installation to the artifact one?

tlienart commented 4 years ago

Hey @KristofferC thanks for this. The problem is now mostly with Travis (is the travis log linked to above, useful in narrowing things down?), @staticfloat tried to reproduce this locally and couldn't (thanks for trying!). However the fact that it fails on Travis is still bad as some users may have setups that would look like it. We currently have a workaround with an fork of OpenSpecFun_jll which should work for users in the mean time but it's not ideal of course.

I'd be happy to try telling Travis to use this libmkl_intel_thread.dylib but not sure how that would work?

ViralBShah commented 2 years ago

IMO this causes silent bugs. I would much rather favor correctness and go for the second option as the solution right away. Since such few packages use libgomp, it is ok to get them to manually dlopen it.

ViralBShah commented 2 years ago

@tlienart can you point me to the fork of openspecfun_jll? I would like to see if we can remove the libgomp dependency in that package.

giordano commented 2 years ago

Since such few packages use libgomp

I don't think this assesment is very accurate (and I don't think it was even accurate last year):

% find . -name '*.jl' | xargs grep -l CompilerSupportLibraries_jll | xargs grep -lv expand_gfortran
./R/raptor/build_tarballs.jl
./R/rr/build_tarballs.jl
./R/RadeonProRender/build_tarballs.jl
./U/UCX/build_tarballs.jl
./I/IpoptMKL/build_tarballs.jl
./I/IOAPI/build_tarballs.jl
./I/ITSOL_2/build_tarballs.jl
./N/NL2sol/build_tarballs.jl
./N/NetCDFF/build_tarballs.jl
./N/NOMAD/build_tarballs.jl
./N/nlminb/build_tarballs.jl
./N/normaliz/build_tarballs.jl
./G/glmnet/build_tarballs.jl
./G/Gettext/build_tarballs.jl
./G/gb/build_tarballs.jl
./G/Gingko/build_tarballs.jl
./Z/zfp/build_tarballs.jl
./Z/ZITSOL_1/build_tarballs.jl
./Z/ZPares/build_tarballs.jl
./T/tblis/build_tarballs.jl
./T/Torch/build_tarballs.jl
./T/Tesseract/build_tarballs.jl
./T/TMatrix/build_tarballs.jl
./T/Trilinos/build_tarballs.jl
./S/SCALAPACK/build_tarballs.jl
./S/SCIP/build_tarballs.jl
./S/SDPA/build_tarballs.jl
./S/SuperLU_MT/build_tarballs.jl
./S/SoXResampler/build_tarballs.jl
./S/Sundials/Sundials@5/build_tarballs.jl
./S/Sundials/Sundials32@5/build_tarballs.jl
./S/spglib/build_tarballs.jl
./S/SoapyRTLSDR/build_tarballs.jl
./S/StarPU/build_tarballs.jl
./S/SHTOOLS/build_tarballs.jl
./S/SCIP_PaPILO/build_tarballs.jl
./S/scopehal/build_tarballs.jl
./S/SPRAL/build_tarballs.jl
./S/SLATEC/build_tarballs.jl
./S/SuiteSparse/SSGraphBLAS/build_tarballs.jl
./A/Arpack/build_tarballs.jl
./A/AMReX/build_tarballs.jl
./A/ADIOS2/build_tarballs.jl
./A/AzStorage/build_tarballs.jl
./F/FastTransforms/build_tarballs.jl
./F/FLANN/build_tarballs.jl
./F/FGlT/build_tarballs.jl
./F/FMM3D/build_tarballs.jl
./F/finufft/build_tarballs.jl
./O/openPMD_api/build_tarballs.jl
./O/OpenAL/build_tarballs.jl
./O/OpenSpecFun/build_tarballs.jl
./O/OpenBLAS/common.jl
./O/ODEInterface/build_tarballs.jl
./O/OpenMPI/build_tarballs.jl
./O/OpenLSTO/build_tarballs.jl
./H/HOHQMesh/build_tarballs.jl
./H/HHsuite/build_tarballs.jl
./H/HiGHS/build_tarballs.jl
./M/MPICH/build_tarballs.jl
./M/Modflow6/build_tarballs.jl
./M/MPItrampoline/build_tarballs.jl
./M/MMseqs2/build_tarballs.jl
./M/muparser/build_tarballs.jl
./M/msolve/build_tarballs.jl
./M/MUMPS/MUMPS_seq_MKL/build_tarballs.jl
./M/MUMPS/MUMPS_seq@5/build_tarballs.jl
./M/MUMPS/MUMPS@5/build_tarballs.jl
./M/MUMPS/MUMPS_seq@4/build_tarballs.jl
./C/CovidSim/build_tarballs.jl
./C/CUTENSOR/build_tarballs.jl
./C/Coin-OR/BiCePS/build_tarballs.jl
./C/Coin-OR/Cgl/build_tarballs.jl
./C/Coin-OR/SYMPHONY/build_tarballs.jl
./C/Coin-OR/CSDP/build_tarballs.jl
./C/Coin-OR/Ipopt/build_tarballs.jl
./C/Coin-OR/CoinUtils/build_tarballs.jl
./C/Coin-OR/Cbc/build_tarballs.jl
./C/Coin-OR/Osi/build_tarballs.jl
./C/Coin-OR/CHiPPS_BLIS/build_tarballs.jl
./C/Coin-OR/MibS/build_tarballs.jl
./C/Coin-OR/Bonmin/build_tarballs.jl
./C/Coin-OR/SHOT/build_tarballs.jl
./C/Coin-OR/ALPS/build_tarballs.jl
./C/Coin-OR/Clp/build_tarballs.jl
./C/CvxCompress/build_tarballs.jl
./C/Chuffed/build_tarballs.jl
./C/CeresSolver/build_tarballs.jl
./C/CUTEst/build_tarballs.jl
./C/cilantro/build_tarballs.jl
./C/COSMA/build_tarballs.jl
./D/DASKR/build_tarballs.jl
./D/difmap/build_tarballs.jl
./D/Darknet/common.jl
./D/Dierckx/build_tarballs.jl
./V/ViennaRNA/build_tarballs.jl
./V/VMEC/build_tarballs.jl
./Q/qr_mumps/build_tarballs.jl
./Q/Qt/build_tarballs.jl
./Q/QuantReg/build_tarballs.jl
./Q/Qt5Base/build_tarballs.jl
./Q/QCDNUM/build_tarballs.jl
./Q/Qt6Base/build_tarballs.jl
./Q/qwtw/build_tarballs.jl
./Q/QuantumEspresso/build_tarballs.jl
./X/xfoil_light/build_tarballs.jl
./X/XGBoost/build_tarballs.jl
./X/Xyce/build_tarballs.jl
./B/Birch_Standard/build_tarballs.jl
./B/beefalo/build_tarballs.jl
./B/basiclu/build_tarballs.jl
./B/blis/build_tarballs.jl
./K/Kokkos/build_tarballs.jl
./L/libpolymake_julia/build_tarballs.jl
./L/libnabo/build_tarballs.jl
./L/LAPACK/build_tarballs.jl
./L/Libxc/build_tarballs.jl
./L/libgraphicsmagic/build_tarballs.jl
./L/libsharp2/build_tarballs.jl
./L/LibBirch/build_tarballs.jl
./L/LibRaw/build_tarballs.jl
./L/LAMMPS/build_tarballs.jl
./L/libsvm/build_tarballs.jl
./L/LoopTools/build_tarballs.jl
./L/LibAMVW/build_tarballs.jl
./P/polymake/build_tarballs.jl
./P/PCL/build_tarballs.jl
./P/PETSc/build_tarballs.jl
./P/primecount/build_tarballs.jl
./P/PGPLOT/build_tarballs.jl
./W/wannier90/build_tarballs.jl
./W/WaveFD/build_tarballs.jl
./W/Wigxjpf/build_tarballs.jl

I could find at least 133 recipes which reference CompilerSupportLibraries_jll which don't expand the libgfortran version: the main libraries for which we have to pull CSL are libgfortran and libgomp, if it isn't one, it's the other. 133 isn't that few packages and having to manually dlopen libraries isn't something easy to do.

ViralBShah commented 2 years ago

But do they all need to dlopen libomp? It is definitely not a good option, but loading MKL.jl after SpecialFunctions.jl does not just error but silently gives wrong answers non-deterministically.

giordano commented 2 years ago

They probably all dynamically link to libgomp, so yes, they do need it. As I said, it's either libgfortran or libgomp.

ViralBShah commented 2 years ago

I added a note to MKL.jl for now about this: https://github.com/JuliaLinearAlgebra/MKL.jl/blob/master/README.md#usage

tlienart commented 2 years ago

openspecfun_jll

not sure if this is still relevant: https://github.com/tlienart/OpenSpecFun_jll.jl?organization=tlienart&organization=tlienart