JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.4k stars 5.45k forks source link

Problems with musl's ldso on 1.6 #40556

Open jpsamaroo opened 3 years ago

jpsamaroo commented 3 years ago

For some reason, musl's loader refuses to reuse libraries that it's already loaded to service later requests for those same libraries. For example, libbz2.so from Bzip2_jll can be successfully dlopen'd on its own, but when trying to dlopen libfreetype.so from FreeType2_jll, we can see (through strace) musl's loader looking through entries in LD_LIBRARY_PATH instead. Trying to modify JLLWrappers.jl to do a withenv("LD_LIBRARY_PATH"=>join(LIBPATH_list, ':')) do ... around the dlopen does not seem to affect the search path, but I can set LD_LIBRARY_PATH manually outside of Julia as usual (so I hypothesize that the environment is cached when ld is first loaded?).

Of course, this doesn't always happen. For me on Alpine Linux, JLLs that do not depend on Bzip2_jll (and a few other problematic JLLs) seem to work fine. @giordano hypothesizes that it's because of a difference between the SONAME and NEEDED entries between dependent libraries, but I've found that even by using patchelf to change these entries, nothing of benefit happens (when it does work, it's only because the loader instead found a matching system library). However, it's totally possible I'm doing this wrong.

If anyone has suggestions on other things to look at or try, please let me know!

jpsamaroo commented 3 years ago

Also likely related is https://github.com/JuliaLang/julia/issues/36458

staticfloat commented 3 years ago

Can you use readelf -d to show what the SONAME that libbz2.so advertises, and what libfreetype.so expects?

jpsamaroo commented 3 years ago

I can set them to match (and I think they do by default), but it doesn't appear that musl actually looks at SONAME, which explains why fiddling with SONAME doesn't really help. All it looks at is NEEDED, and whatever paths were originally passed to dlopen.

giordano commented 3 years ago

Building the library with the right SONAME (see the patch in Alpine) then should help because we'd open the library in the JLL package with the same name as the NEEDED one.

jpsamaroo commented 3 years ago

I'm doing some printf debugging with ldso and julia, and also asking around on the musl IRC. It seems like we may also be doing something "wrong", in that we do a dlopen of .../libbz2.so, whereas the NEEDED path is libbz2.so.1.0, so ldso considers them not to match.

jpsamaroo commented 3 years ago

Ok, so I've spoken a bit with the musl devs, and here's the gist of it:

Some relevant quotes:

then you don't want the needed at all if lib A depends on symbols from lib B1, B2, or B3 but you don't know which, then you don't put a DT_NEEDED on lib B you just omit the DT_NEEDED and load the appropriate libB[123] before loading A

dlopen without a / in the argument means specifically to look for the library in the search path. not to use something loaded with an explicit path that just happens to have the same name

Link statically

jpsamaroo commented 3 years ago

The musl devs have refused my proposal to patch ldso to look at SONAME and compare it with NEEDED entries. Therefore, I have prepared a patch which adventurous users may apply to their ldso to workaround this issue.

WARNING: I do not make any claims about how safe this patch is to use in general; I recommend only applying this as your Julia interpreter, and that's it (patchelf --set-interpreter /path/to/ld-musl-*.so.1 /path/to/julia). Please let me know if you find any issues with it, and I'll update the patch!

--- ldso/dynlink.c  2021-01-14 20:26:00.000000000 -0600
+++ ldso/dynlink.c  2021-04-22 14:21:02.230108318 -0500
@@ -93,6 +93,7 @@
    struct td_index *td_index;
    struct dso *fini_next;
    char *shortname;
+   char *soname;
 #if DL_FDPIC
    unsigned char *base;
 #else
@@ -1047,6 +1048,8 @@
        for (p=head->next; p; p=p->next) {
            if (p->shortname && !strcmp(p->shortname, name)) {
                return p;
+           } else if (p->soname && !strcmp(p->soname, name)) {
+               return p;
            }
        }
        if (strlen(name) > NAME_MAX) return 0;
@@ -1153,6 +1156,12 @@
        return 0;
    }
    memcpy(p, &temp_dso, sizeof temp_dso);
+   for (int i=0; p->dynv[i]; i+=2) {
+       if (p->dynv[i] == DT_SONAME) {
+           p->soname = p->strings + p->dynv[i+1];
+           break;
+       }
+   }
    p->dev = st.st_dev;
    p->ino = st.st_ino;
    p->needed_by = needed_by;
@@ -1885,6 +1894,11 @@
    reclaim_gaps(&app);
    reclaim_gaps(&ldso);

+   /* Set initial soname */
+   app.soname = NULL;
+   ldso.soname = NULL;
+   vdso.soname = NULL;
+
    /* Load preload/needed libraries, add symbols to global namespace. */
    ldso.deps = (struct dso **)no_deps;
    if (env_preload) load_preload(env_preload);

License is MIT, same as musl.

staticfloat commented 3 years ago

So if I'm understanding correctly, musl already has a fast-path for previously-loaded libraries, which is the p->shortname business.

The fast-path: https://github.com/ifduyue/musl/blob/aad50fcd791e009961621ddfbe3d4c245fd689a3/ldso/dynlink.c#L1048-L1052

Where p->shortname usually gets initialized: https://github.com/ifduyue/musl/blob/aad50fcd791e009961621ddfbe3d4c245fd689a3/ldso/dynlink.c#L1164

It seems to me that possibly, we don't get p->shortname filled in when we dlopen() because we pass in the full pathname? @jpsamaroo can you try experimenting with passing in imperfect (but still valid) pathnames like /path/to/../to/foo.so or ./foo.so and see if p->shortname is properly filled in? If that is the case, can you see if it then fast-paths and never hits the filesystem when loading things that depend on foo.so?

staticfloat commented 3 years ago

Pinging @richfelker, who may have a better idea on how to solve this.

Julia, as a language, makes use of the behavior on glibc Linux, macOS, FreeBSD and Windows, that if libfoo.so depends on libbar.so, and you have already dlopen("/path/to/libbar.so"), then when you dlopen("/other/path/to/libfoo.so"), if the filenames match (on glibc, if the SONAMEs match, on macOS if the dylib ID's match) then the previously-loaded libbar.so is taken to be the dependency for libfoo.so.

This allows us to distribute libraries independently (e.g. we do a nix-like thing where we install binaries into content-addressed directories, and so the paths to different libraries is not knowable at compile-time or even at julia process start-up) with some wrapper code that performs the correct dlopen() calls, in order, to satisfy the dynamic linker. We manage compatibility ranges and whatnot using the typical Julia package resolver.

The difficulty on musl is that when we dlopen("/other/path/to/libfoo.so"), it searches the filesystem for libbar.so, depsite us having loaded libbar.so previously. This appears to be because the only time the short happy path is taken on musl, is when p->shortname is initialized, and this only happens if you don't load a library by its full path. One possibility is the patch @jpsamaroo posted above, where we patch ldso to track the SONAME of objects that are loaded. Another possibility is the workaround I cooked up a few weeks ago where we have Julia manually insert the SONAME as p->shortname, but of course that is fragile and evil and I'd really like it to never see the light of day. (However, it does have the benefit of working on older musl versions).

I'm curious to hear your thoughts about the best way to achieve feature parity here. As far as I can tell, there is no way for a process to satisfy the dynamic linker when loading libraries if it doesn't know the location of all the libraries beforehand, because in musl the only way for us to influence the search locations are by embedding paths into objects (e.g. rpath/runpath, not usable here because we don't know the relative paths between dependencies at compile-time) or setting environment variables (e.g. LD_LIBRARY_PATH, not usable here because we don't know which libraries we're loading when we start up Julia, and modifying environment variables later doesn't work).

Thank you for your time!

richfelker commented 3 years ago

Can you explain what you're trying to achieve? The current behavior is intentional and serves to prevent modules loaded with an explicit pathname from shadowing libraries in the search path just because they happen to have the same name.

If libfoo depends on a version of libbar at a particular location (absolute or relative to libfoo), the intended way to represent that is via rpath in libfoo, using $ORIGIN base if needed. Is there a reason this does not work for what you're trying to achieve?

staticfloat commented 3 years ago

Can you explain what you're trying to achieve?

We ship libraries to users outside of normal distribution channels; they do not get installed into e.g. /usr/local/lib, they get installed into Julia package "depots", which can live anywhere, such as in a system-wide location (like if you're working in a computer lab, and they have a selection of Julia packages that are preinstalled, living in /opt/julia-1.6/depot) or, more typically, in your home directory (e.g. ~/.julia). Libraries are installed into content-addressed locations, and if a library is already available in one depot, it does not get installed into another. So, for example, libfoo.so may exist at ~/.julia/artifacts/<content_hash>/lib/libfoo.so, and libbar.so may exist at /opt/julia-1.6/depot/artifacts/<content_hash>/lib/libbar.so. Because we serve precompiled binaries to users (so they do not have to compile GTK, Cairo, etc....), at compilation time we cannot know the relative paths beforehand. We also can't use LD_LIBRARY_PATH because we would need to set that up before starting the Julia interpreter, and because we expect to be able to install new packages and use them immediately in a single process session, we can't know all the paths that need to be pushed onto LD_LIBRARY_PATH either.

The current behavior is intentional and serves to prevent modules loaded with an explicit pathname from shadowing libraries in the search path just because they happen to have the same name.

This makes sense, however I believe the idea behind the SONAME mechanism is to provide exactly this behavior; e.g. if a library defines an SONAME for itself, it is essentially claiming that part of the library namespace and declaring that anyone that has a DT_NEEDED entry that matches exactly should instead be served by it. If there were no SONAME entries in our libraries, I would agree that this is intended behavior.

staticfloat commented 2 years ago

@richfelker just checking in with you to see if you have any suggestions on how we can get musl to support this use case. To give a really high-level overview of the possible paths forward:

1) We can cook up a patch to musl that is (or even identical) to @jpsamaroo's patch above and submit it for review again. This would cause the dependency resolution at dlopen() time to pay attention to the SONAME for libraries that are loaded by path, as well as those loaded by basename. This is, of course, my preferred route, but I think we may need some more feedback from you or other developers, since the first time Julian submitted it, it was apparently rejected.

2) We can patch musl locally and then statically embed our own patched loader into the julia executable. This is undesirable to me as I can imagine an old Julia version being run in a newer musl environment and causing problems when loading libraries that may depend on newer behaviors of musl.

3) We can perform dynamic runtime modification of the internal dso datastructures as described in my message above. This is undesirable as it's fragile (you need to know the exact dso structure layout, which changes from musl version to musl version) and is not forward-compatible.

4) The musl devs could construct some alternative mechanism by which a program can, at runtime, direct the search path of libraries. Since the fundamental issue is that we don't know the location of all libraries until well after program start, and all mechanisms (such as RPATH, LD_LIBRARY_PATH, etc...) are immutable after process startup, there's no way for us to load a libfoo.so that depends on libbar.so but doesn't know precisely where on disk it is stored. If we could manually instruct the loader to check a certain set of paths, and then change that set before loading each library, that would work as well, even though it's significantly more work than just loading each library individually by path as works on all the other platforms we support.

Thank you again for you time!

jpsamaroo commented 2 years ago

This is, of course, my preferred route, but I think we may need some more feedback from you or other developers, since the first time Julian submitted it, it was apparently rejected.

Note that I did not in fact submit the patch to musl, since I was told on official IRC by a mod that they were not interested in SONAME support and would not accept the patch.

richfelker commented 2 years ago

I still don't understand why it's as complicated as it is, but a really stupid solution is just putting a temp dir in the library path and instead of dlopen(absolute_path), symlink(absolute_path, basename_in_temp_dir); dlopen(basename); unlink(basename_in_temp_dir);

StefanKarpinski commented 2 years ago

The fundamental constraint is that our libraries are pre-built and immutable (no patching them after installation) and you cannot know, at build time, what the exact path where they will be installed will be, yet they need to be able to depend on each other. This is why it's complicated. Although, I would argue this is how dynamic libraries should work, but that's a matter of opinion.

Your workaround is interesting—having to symlink a library into a temp directory for library loading seems... unfortunate, although I suppose at least we can probably count on tempdirs and symlinks working reasonably well on any musl system.

StefanKarpinski commented 2 years ago

Also, would you mind giving some rationale for the lack of SONAME support in musl? Is there some reason that feature would be problematic?

richfelker commented 2 years ago

I don't see it as just "SONAME support" but as the intersection of 2 behaviors: SONAME based lookup and insertion of libraries loaded via explicit pathnames into the set of names searched for dependencies. The latter of these was intentionally not done, as explained before. Doing it just for SONAMEs might be a reasonable behavior, but it's something with potentially far-reaching consequences that would call for a proposal, analysis, input from community, etc. This is a big ask for something that has only come up with one project using dlopen outside of the manner it's documented to work in. That's not to say it's impossible to happen sometime. I just don't think it makes sense to have resolving this issue rest on it.

Hello71 commented 2 years ago

I posted https://www.openwall.com/lists/musl/2021/12/16/1 "Satisfying DT_NEEDED from previous dlopens with explicit path" to the musl mailing list about this issue.

grasph commented 5 months ago

Hello,

I've bumped into the dlopen issue while testing a _jll I produced (issue report). Thanks @giordano to have pointed me to this thread.

Was a decision taken?

If it's not possible to change the linker behaviour, what about including in the linker search path a directory, where a link is set to each provided library (including not dlopen'ed ones) when a _jll is imported? Links to libraries loaded by a dlopen called in another context can also be added in order to have similar behaviours with musl and glibc linkers . Alternatively it can be links to artifact directories and libraries rpaths set accordingly.

Links to executables provided by the _jll's can also be included, saving the need of a long LD_LIBRARY_PATH to run them.

The difference with Rich's suggestion is that the links are kept during the whole Julia session and the solution does not rely on the linker skipping dependencies previously dlopen'd (using a relative or absolute path).

Philippe.

staticfloat commented 1 month ago

Jameson, Mosè, Valentin and I talked about this, and we came up with three ways forward:

  1. In-memory manipulation of the libc datastructures, as shown in https://github.com/JuliaPackaging/JLLWrappers.jl/pull/34. We don't like this because it's fragile and seems unnecessarily dangerous when musl changes its memory layout.
  2. Have julia re-exec itself with Julian's patched dynamic loader. This is better, but it won't work for embedding of Julia (e.g. where we can't re-exec ourselves nicely).
  3. Intercept dlopen() by patching the function address at runtime. There are a number of other interposers already (VirtualGL, UCX, etc...) and although this is somewhat cursed, it's probably the most workable option. In this scenario, we would be intercepting calls to dlopen() (we would need to ensure that all other loaded objects also get their own PLTs rewritten to point to our dlopen() implementation as well) and searching our loaded library cache for matching SONAMEs. If a matching SONAME is found, we return it, otherwise we sub off to the libc dlopen().

Honorable mention to the idea of building glibc's libdl, bundling that, and just using it for our dlopen(), but at that point why do we have a musl build at all.

giordano commented 1 month ago

For someone willing to implement solution 3, to get inspiration you can look at ucm_dlopen in https://github.com/openucx/ucx/blob/67c0310f43d74a8b06fe0ca0c88bb97444dd92d9/src/ucm/util/reloc.c#L562 and how it's used in the project. CC: @contra-bit.

grasph commented 1 month ago

Thanks Elliot, Jameson, Mosè, and Valentin to have looked at the issue. Did you discuss about the symbolic link solution? Did you rule out, and if so, what would be the issues with this option ?

staticfloat commented 1 month ago

In order to add a new directory onto the search path, we'd need to have it as part of LD_LIBRARY_PATH when Julia was launched, so we could create some kind of wrapper to create a temporary directory, set LD_LIBRARY_PATH to include that temporary directory, and then launch Julia. Then, anytime we do a Pkg operation (such as Pkg.add(), Pkg.rm(), Pkg.update(), Pkg.activate(), etc...) we'd need to analyze the set of JLLs within the environment and add/remove symlinks to their libraries. This would work for most JLLs (those with libraries not in the top-level lib directory might have a hard time here) but IMO it's a lot of extra work for a workaround that involves a lot of computation and disk activity.

grasph commented 1 month ago

The links could be set when loading the _jll modules. With a single directory tree rpath and $ORIGIN can be used instead of setting LD_LIBRARY_PATH.

staticfloat commented 1 month ago

You need one per Julia environment, which cannot be done with just RPATH, and you need to be able to change it when you activate a different environment, which means it must be per-process and cannot be shared.

grasph commented 1 month ago

My idea was to still use the full path of symbolic link for the dlopen and include a relative rpath. In that way the linker will be able to load the dependencies using the relative path. Reuse of already dlopened ones is let as a linker implementation choice.

If the links are created when loading the _jll (import), you need one directory per Julia process and no change is needed at environment activation.

Note that this will add the possibility for a _jll, that provides many libraries to be used with some other shared libraries or executable, to postpone the loading of each of them until it is used, and limit the loading to the actually needed libraries.

staticfloat commented 1 month ago

My idea was to still use the full path of symbolic link for the dlopen and include a relative rpath. In that way the linker will be able to load the dependencies using the relative path.

This does not work because the dynamic linker sees absolute paths, not symlinks; everything is dereferenced by the time the dynamic linker gets ahold of it. You can see this in action with this testing repository, where make simple works, but make symlinked errors out because libfoo.so cannot be found. While I demonstrate this with an executable, the same holds true for libraries.

richfelker commented 1 month ago

3. Intercept dlopen() by patching the function address at runtime. There are a number of other interposers already (VirtualGL, UCX, etc...) and although this is somewhat cursed, it's probably the most workable option. In this scenario, we would be intercepting calls to dlopen() (we would need to ensure that all other loaded objects also get their own PLTs rewritten to point to our dlopen() implementation as well) and searching our loaded library cache for matching SONAMEs. If a matching SONAME is found, we return it, otherwise we sub off to the libc dlopen().

I'm not clear how this would help. If you just want your own calls to dlopen to find the libraries you loaded by absolute path, you can use your own wrapper for dlopen without "patching the function address" from all your call points. But if you want these libraries to satisfy recursive dependencies for other libraries when dlopen is called, that will not work. Recursive dependencies are not recursive calls to dlopen. dlopen is transactional and resolves the whole dependency tree before committing anything, and fails the transaction if any part fails.

Honorable mention to the idea of building glibc's libdl, bundling that, and just using it for our dlopen(), but at that point why do we have a musl build at all.

glibc's libdl is just glue to their underlying dynamic linker. It's not the dynamic linker/loader itself, and trying to use it with musl doesn't even conceptually make sense.

staticfloat commented 1 month ago

But if you want these libraries to satisfy recursive dependencies for other libraries when dlopen is called, that will not work.

Yes, we want that. We also want to have the same logic applied (e.g. short-circuiting the search for a library if another library with the same SONAME has already been loaded) if a library itself decides that it wants to dynamically dlopen() something, hence the "must replace the dlopen() binding in all loaded modules" requirement.

richfelker commented 1 month ago

OK, in that case, if I understand things correctly, the solution is not to use ELF-level DT_NEEDED dependencies between your libraries (i.e. only use them for deps on system-level libs, not your language-runtime-managed libs) so that the system dynamic linker's dependency search/resolution is not involved. The only way this wouldn't work is if you're trying to use RTLD_LOCAL to avoid putting symbols in the global namespace. Otherwise, just load things in dependency-order via your own dependency management.

staticfloat commented 1 month ago

We're not loading libraries that we manage here; we're loading things like GTK. We're more or less running a userspace distribution, but for more than just Linux. We install our libraries into content-addressed storage (similar to Nix), and manually call dlopen() in dependency-order so as to load them with as little interaction with the dynamic linker's search, as you recommend. However, we don't want to eliminate DT_NEEDED for two reasons:

  1. We package thousands of different pieces of software, using many different build systems. It would be infeasible to patch the build systems of these complex software projects to no longer embed DT_NEEDED. We could perform a post-processing pass to strip those DT_NEEDED flags out (only on musl targets, as every other libc doesn't require this workaround), I suppose.
  2. It is a nice property that if the individual content-addressed directories are copied over eachother, (e.g. all dependencies are unpacked into the same prefix) everything "just works" without any dependecy-order loading. We make use of this in multiple places throughout the ecosystem, so I would be loathe to remove this capability.