JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.41k stars 5.45k forks source link

ccall(:foo) doesn't pick up LD_PRELOAD overrides #53747

Open green-nsk opened 5 months ago

green-nsk commented 5 months ago

we're using a custom network stack library that loads via LD_PRELOAD mechanism and overrides certain function calls (socket(), recv(), setsockopt() and others). For some reason starting julia-1.10 ccall(:socket) doesn't pick up LD_PRELOAD version of the call anymore.

We found a workaround to call ccall((:socket, "")) works with our LD_PRELOAD network stack. We'd like to understand what's changed and how the two ccall() versions are different and how we can make sure we don't hit that in the future versions.

Repro:

$ cat socket.c
__attribute__((visibility("default")))
int
socket(int domain, int type, int protocol) {
    return 42;
}

$ gcc -shared -o socket.so -fPIC socket.c
$ LD_PRELOAD=./socket.so julia-1.9.3 -e 'println(ccall(:socket, Cint, (Cint, Cint, Cint), 0, 0, 0))'
42
$ LD_PRELOAD=./socket.so julia-1.10.2 -e 'println(ccall(:socket, Cint, (Cint, Cint, Cint), 0, 0, 0))'
-1
$ LD_PRELOAD=./socket.so julia-1.10.2 -e 'println(ccall((:socket, ""), Cint, (Cint, Cint, Cint), 0, 0, 0))'
42
$ LD_PRELOAD=./socket.so julia-1.10.2 -e '
    using Libdl
    socket_ptr = dlopen("") do h ; dlsym(h, :socket) end
    println(ccall(socket_ptr, Cint, (Cint, Cint, Cint), 0, 0, 0))'
42

Julia downloaded from https://julialang-s3.julialang.org/bin/linux/x64/1.10/julia-1.10.2-linux-x86_64.tar.gz

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
  JULIA_PKG_DEVDIR = /home/sfokin/code/
gbaraldi commented 5 months ago

While I haven't looked too deep into this, I imagine it has something to do with RTLD_DEEPBIND, which we use in many places. @staticfloat might have a better idea.

green-nsk commented 5 months ago

@staticfloat or anyone else, is there more insight on what's happening?

At the very least I'd like to understand is it a bug/regression or is it by design?

maleadt commented 5 months ago

Bisected to 82c89c680e8326da92412a082073e5e4044fd14f from https://github.com/JuliaLang/julia/pull/50162; cc @topolarity.

gbaraldi commented 5 months ago

So the reason this happens is that we lookup the symbols first in the libraries before we look them up in the main executable. That kind of behaves like RTLD_DEEPBIND, which has the side effect of making interposing symbols like this not work. The reason for that is to allow multiple julias to be loaded in the same process. I do wonder if we could try to restore the interposer behaviour a bit by looking non julia symbols in the executable and julia symbols in the library.

topolarity commented 5 months ago

It's worth mentioning that (as @gbaraldi points out), the new behavior is similar to the (pre-existing) behavior from the C side which is due to RTLD_DEEPBIND:

$ cat mylib.c
int socket(int domain, int type, int protocol);

__attribute__((visibility("default")))
int socket2(int domain, int type, int protocol) {
    return socket(domain, type, protocol);
}
$ gcc -shared -o socket.so -fPIC socket.c
$ gcc -shared -o mylib.so -fPIC mylib.c
$ LD_PRELOAD=./socket.so julia-1.10 -e 'println(ccall((:socket2, "./mylib.so"), Cint, (Cint, Cint, Cint), 0, 0, 0))'
-1

I didn't think about LD_PRELOAD specifically when we made this change, but I think the need for an explicit opt-in to the interposition via ccall((:symbol, ""), ...) is probably a good thing despite the unintended breakage.

vchuravy commented 5 months ago

This might break system profilers that use LD_PRELOAD?

As an example in MPI.jl we use the pattern of ccall(:symbol) so that MPI profilers can hock these symbols explicitly https://github.com/JuliaParallel/MPI.jl/pull/451 / https://github.com/JuliaParallel/MPI.jl/pull/450

We can of course change MPI to use the (:call, "") syntax...

green-nsk commented 5 months ago

We already have a workaround for our case, but I can't see why inconsistent behaviour is acceptable. Also, I am worried there may be other unintended inconsistencies:

If all of that is not a concern, at the very least there should be a mention of those different behaviours in call documentation.

topolarity commented 5 months ago

julia internal symbol resolution is inconsistent with ccall() resolution. In my particular case, LD_PRELOAD=socket.so affects calls to socket() from inside Sockets.jl/libuv, but not from ccall(:socket). This is probably the worst one.

I agree that this is important.

The problem is that it's already inconsistent in 1.9:

$ cat socket.c
#include "stdio.h"
__attribute__((visibility("default")))
int socket(int domain, int type, int protocol) {
    fprintf(stderr, "Called LD_PRELOAD socket() hook\n");
    return 42;
}
$ LD_PRELOAD=./socket.so julia-1.9 -e 'using Sockets; TCPSocket(; delay = false)'
Called LD_PRELOAD socket() hook
$ LD_PRELOAD=./socket.so julia-1.9 -e 'using MySQL; DBInterface.connect(MySQL.Connection, "localhost", "user", "passwd")'
<no hook message>

Both of these use socket() internally, but only one picks up the LD_PRELOAD overload.

What's the deal? The difference here is that MySQL.jl is using libmariadbclient, which is loaded via JLL/ccall with RTLD_DEEPBIND, meaning that its symbol resolution is not affected by LD_PRELOAD.

In contrast, any of the "built-in" libraries (defined via DEP_LIBS here) are loaded without RTLD_DEEPBIND, which is why libuv does pick up the LD_PRELOAD

topolarity commented 5 months ago

I do wonder if we could try to restore the interposer behaviour a bit by looking non julia symbols in the executable and julia symbols in the library.

We could check to see which library the symbol actually resolved to, and if it's not one of the libjulia-* we could repeat the look-up with the old behavior?

green-nsk commented 5 months ago

We could check to see which library the symbol actually resolved to, and if it's not one of the libjulia-* we could repeat the look-up with the old behavior?

Wouldn't it mean that different Julia processes potentially loading different versions of JLL's will step onto each other's toes?