Open wgmitchener opened 4 days ago
We probably should consider deprecating that. There is never a reason a user should be calling it before dlopen, as the user should just call dlopen instead. This function is from long before we had LazyLibrary and precompile such and we were experimenting with ways of making ccall
work more reliably.
We use find_library
in MPI.jl
to find the libmpi library, and save it as a preference (to be able to invalidate the cache in case we need to use a different libmpi): https://github.com/JuliaParallel/MPI.jl/blob/aac9688e6961bc7e3aeeba7600f5e7d0b10596a3/lib/MPIPreferences/src/MPIPreferences.jl#L194 No need to dlopen
the library after find_library
(yes, there's a call to identify_abi
which internally calls dlopen
, but that's unrelated and besides the point)
That is true, that usage is probably fine, but you might be better suited there to calling dlopen
+dlpath
directly instead? But notably there you wouldn't do that during precompile (since you'll corrupt the cache file due to the unsafe modification of preferences while loading) and therefore also wouldn't typically use dlopen directly afterwards either (as it isn't compatible with the already loaded MPI)
I think the underlying issue is that dlopen
doesn't just open a file, it also runs functions in the .so file that are marked with a particular attribute. Those functions can apparently make persistent changes to the executable state that are not undone by dlclose
. In reading the man page for dlopen
, I don't see anything that requires libraries to be written so that they can be opened and closed multiple times without error. This is the first time I've encountered a library that generates errors in this way.
This may also be related: https://github.com/llvm/llvm-project/issues/47565 It's a similar bunch of errors that apparently come from some kind of global symbol clash caused by opening multiple versions of LLVM.
However, that wouldn't explain why the simple C example program triggers the error. Only one version of LLVM is ever involved in that.
The
find_library
function inlibdl.jl
makes two calls todlopen
while searching for a library. It has been discovered that this causes problems when looking for the ROCm librarylibamdhip64.so
on Fedora 40 Linux. ROCm links with LLVM, and callingdlopen
on it twice causes LLVM to crash and report some inconsistency in its settings:See this thread for where the problem shows up.
The problem is not specific to Julia. It can be reproduced in C by calling
dlopen
on the library, thendlcose
, thendlopen
again. See this post.My guess is that as called in
find_library
,dlopen
follwed bydlcose
leaves some state from the library in the process's working memory, and this state causes confusion whendlopen
is called on the library again.Can
find_library
be re-implemented so as not to actually calldlopen
anddlclose
on the file once it's been found?In general, programs will call
find_library
followed bydlopen
, so if it's possible forfind_library
to change the state of the process so that a call todlopen
afterward might see leftover state and crash, it makes sense to me thatfind_library
needs to be rewritten to not calldlopen
at all.