JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.46k stars 5.46k forks source link

crashed caused by calling dlopen in find_library #55801

Open wgmitchener opened 4 days ago

wgmitchener commented 4 days ago

The find_library function in libdl.jl makes two calls to dlopen while searching for a library. It has been discovered that this causes problems when looking for the ROCm library libamdhip64.so on Fedora 40 Linux. ROCm links with LLVM, and calling dlopen on it twice causes LLVM to crash and report some inconsistency in its settings:

: CommandLine Error: Option 'disassemble' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

See this thread for where the problem shows up.

The problem is not specific to Julia. It can be reproduced in C by calling dlopen on the library, then dlcose, then dlopen again. See this post.

My guess is that as called in find_library, dlopen follwed by dlcose leaves some state from the library in the process's working memory, and this state causes confusion when dlopen is called on the library again.

Can find_library be re-implemented so as not to actually call dlopen and dlclose on the file once it's been found?

In general, programs will call find_library followed by dlopen, so if it's possible for find_library to change the state of the process so that a call to dlopen afterward might see leftover state and crash, it makes sense to me that find_library needs to be rewritten to not call dlopen at all.

vtjnash commented 3 days ago

We probably should consider deprecating that. There is never a reason a user should be calling it before dlopen, as the user should just call dlopen instead. This function is from long before we had LazyLibrary and precompile such and we were experimenting with ways of making ccall work more reliably.

giordano commented 3 days ago

We use find_library in MPI.jl to find the libmpi library, and save it as a preference (to be able to invalidate the cache in case we need to use a different libmpi): https://github.com/JuliaParallel/MPI.jl/blob/aac9688e6961bc7e3aeeba7600f5e7d0b10596a3/lib/MPIPreferences/src/MPIPreferences.jl#L194 No need to dlopen the library after find_library (yes, there's a call to identify_abi which internally calls dlopen, but that's unrelated and besides the point)

vtjnash commented 3 days ago

That is true, that usage is probably fine, but you might be better suited there to calling dlopen+dlpath directly instead? But notably there you wouldn't do that during precompile (since you'll corrupt the cache file due to the unsafe modification of preferences while loading) and therefore also wouldn't typically use dlopen directly afterwards either (as it isn't compatible with the already loaded MPI)

wgmitchener commented 2 days ago

I think the underlying issue is that dlopen doesn't just open a file, it also runs functions in the .so file that are marked with a particular attribute. Those functions can apparently make persistent changes to the executable state that are not undone by dlclose. In reading the man page for dlopen, I don't see anything that requires libraries to be written so that they can be opened and closed multiple times without error. This is the first time I've encountered a library that generates errors in this way.

wgmitchener commented 2 days ago

This may also be related: https://github.com/llvm/llvm-project/issues/47565 It's a similar bunch of errors that apparently come from some kind of global symbol clash caused by opening multiple versions of LLVM.

However, that wouldn't explain why the simple C example program triggers the error. Only one version of LLVM is ever involved in that.