ValveSoftware / steam-runtime

A runtime environment for Steam applications
Other
1.2k stars 86 forks source link

libcapsule fallback library path is incorrect on (most) 64-bit systems #704

Open asahilina opened 5 days ago

asahilina commented 5 days ago

Please describe your issue in as much detail as possible:

In this line, libcapsule uses /lib:/usr/lib as a fallback path for library lookup. This is incorrect for 64-bit libraries/systems, which typically use /lib64:/usr/lib64. It should use the correct path for the running system. For example, this can be obtained by invoking the loader:

$ /lib/ld-linux.so.2 --list-diagnostics | grep 'path\.system_dirs' | cut -d= -f2 | tr -d '"'
/lib/
/usr/lib/
$ /lib64/ld-linux-x86-64.so.2 --list-diagnostics | grep 'path\.system_dirs' | cut -d= -f2 | tr -d '"'
/lib64/
/usr/lib64/

I experienced this myself under some circumstances. This is a user report of the same problem. This happens in particular when ld.so.cache has not been updated and does not contain a particular library that is available in /usr/lib64, so the dynamic linker would find it, but libcapsule doesn't.

The circumstances under which this matters are complex so I don't have easy repro steps, but I hope the bug/problem is clear from the description.

smcv commented 5 days ago

In this line, libcapsule uses /lib:/usr/lib as a fallback path for library lookup. This is incorrect for 64-bit libraries/systems, which typically use /lib64:/usr/lib64.

What architecture and OS are you using here?

Looking at current glibc source, the hard-coded path appears to be SYSTEM_DIRS, which is $(slibdir):$(libdir) from the glibc build system, which is not straightforward to determine programmatically. (I'd really prefer not to have to invoke a command-line program with an option that is specifically labelled "diagnostics" to do something that is load-bearing!)

It seems distros also patch this, for example Debian ends up with /lib/x86_64-linux-gnu/:/usr/lib/x86_64-linux-gnu/:/lib/:/usr/lib/ instead of the path that is documented to be hard-coded.

This happens in particular when ld.so.cache has not been updated

libcapsule really relies on ld.so.cache to be up to date, so if there's an OS distribution that doesn't already reliably update that cache with package manager triggers, I'd recommend fixing that.

One possible approach to avoiding this concern would be to make pressure-vessel run the equivalent of ldconfig -X -C ~/tmp/ld.so.cache to make sure that an up-to-date cache exists by generating it ourselves (according to this specific OS's search path), and then tell libcapsule to parse that as though it was the host system's ld.so.cache. This has a time cost of ~ 2 seconds wall-clock time on a reasonably modern x86_64 with SSD, but we'd need to evaluate how long it takes on a worst-case system (an old/slow machine with HDD).

smcv commented 5 days ago

libcapsule really relies on ld.so.cache to be up to date, so if there's an OS distribution that doesn't already reliably update that cache with package manager triggers, I'd recommend fixing that.

Specifically, mechanisms like soname-match:libnvidia-glcore.so.*, which we use to locate dlopen'd dependencies without knowing a specific name for them, rely on the ld.so.cache.

asahilina commented 5 days ago

What architecture and OS are you using here?

This is a Fedora Asahi Remix system, so aarch64, with FEX-Emu and an x86_64/x86 rootfs built out of Fedora packages. So from the point of view of libcapsule, essentially an x86_64 Fedora system.

libcapsule really relies on ld.so.cache to be up to date, so if there's an OS distribution that doesn't already reliably update that cache with package manager triggers, I'd recommend fixing that.

We are using overlays to compose add-ons for the rootfs, so there's no point at which the complete composed rootfs exists until they get mounted on end-user systems. This only affects libraries in the add-ons, since the ld.so.cache is up-to-date for the base rootfs. We could try to update the ld.so.cache when the environment is started up (in a tmpfs or something), but if that takes ~2 seconds that's already twice our muvm startup time, and then we'd have to develop a caching system so we don't do it every time...

Right now the workaround I proposed is adding the standard paths to DT_RUNPATH for the affected libraries. The reason why I opened this bug is that I think this is logically a libcapsule bug since it's a mismatch with the system ld.so search behavior, and I assume the intent is to match it.

(We actually ran into the same problem with the Flatpak extension on the native aarch64 system, though in that case the paths are nonstandard so the defaults don't work, and ld.so.cache is not being updated due to Flatpak not having any mechanism for such an update to be triggered when we update our package. The logic to regenerate ld.so.cache exists in Flatpak, there's just no way for us to trigger an invalidation of the cache yet from our end. The workaround is the same, DT_RUNPATH.)

One possible approach to avoiding this concern would be to make pressure-vessel run the equivalent of ldconfig -X -C ~/tmp/ld.so.cache to make sure that an up-to-date cache exists by generating it ourselves (according to this specific OS's search path), and then tell libcapsule to parse that as though it was the host system's ld.so.cache. This has a time cost of ~ 2 seconds wall-clock time on a reasonably modern x86_64 with SSD, but we'd need to evaluate how long it takes on a worst-case system (an old/slow machine with HDD).

I think this might make more sense than us doing it, since the whole steam startup takes way longer proportionally, and it would fix the problem for any other users who end up in this situation. But I think if there's some way to get the system path reasonably, that might still be the path of least resistance?

smcv commented 5 days ago

The logic to regenerate ld.so.cache exists in Flatpak, there's just no way for us to trigger an invalidation of the cache yet from our end

If this is for what Flatpak calls "unmaintained extensions" (extensions that do not exist in a libostree repo), I suggested a possible approach to cache invalidation in https://github.com/flatpak/flatpak/issues/5948#issuecomment-2383775703.

smcv commented 5 days ago

We could try to update the ld.so.cache when the environment is started up (in a tmpfs or something), but if that takes ~2 seconds that's already twice our muvm startup time

How often does this happen? (Once per boot? Once per Steam startup? Once per x86 program startup? ...)

this might make more sense than us doing it, since the whole steam startup takes way longer proportionally

Unfortunately pressure-vessel starts from a clean slate every time a containerized game is launched (plus an extra time to run steamwebhelper, which behaves like a special case of a containerized native Linux game), rather than being something that Steam runs once during its own startup.

We intentionally don't try to cache information between runs, because as you've observed, cache invalidation is hard to get right (and in principle we would need to invalidate the cache every time a directory in the search path changes, which is certainly impossible to achieve if we don't always know what directories are part of the search path). With no cache internal to pressure-vessel, the worst case scenario is that non-atomic OS upgrades (.deb/.rpm style) temporarily break the running game until the user exits and restarts, which is easier to explain than "sometimes" requiring some sort of special cache invalidation step.

smcv commented 5 days ago

As a first step towards this, I think I'm going to add a debugging/development environment variable that will be used as if it was the hard-coded fallback search path. That solves half of the problem: it won't automatically learn from glibc what the hard-coded fallback search path is (because I'm not currrently aware of a way to do this programmatically without screen-scraping the output of a diagnostic tool), but if you're an OS vendor who already knows what the right answer is for your particular OS, you can tell us.

smcv commented 5 days ago

We already run ldconfig once during each pressure-vessel startup, to build a ld.so.cache inside the container, because that turns out to be the only reliable way to tell games which libraries to load. We originally used LD_LIBRARY_PATH, but a lot of older games think they know better and will arbitrarily overwrite that. The price we pay for interoperability :-(

asahilina commented 5 days ago

How often does this happen? (Once per boot? Once per Steam startup? Once per x86 program startup? ...)

Once per muvm startup, which for example would be once per Steam startup the way we package Steam today. Essentially "any time an x86 program is launched and no x86 programs are running yet" is the plan.

As a first step towards this, I think I'm going to add a debugging/development environment variable that will be used as if it was the hard-coded fallback search path. That solves half of the problem: it won't automatically learn from glibc what the hard-coded fallback search path is (because I'm not currrently aware of a way to do this programmatically without screen-scraping the output of a diagnostic tool), but if you're an OS vendor who already knows what the right answer is for your particular OS, you can tell us.

This is similar to setting LD_LIBRARY_PATH but then we still have the arch problem. The correct search path is /lib:/usr/lib for 32-bit and /lib64:/usr/lib64 for 64-bit, there isn't one correct global value. If you make the env variable name contain the architecture, then that would work (and I'm happy to put this into muvm startup so it's transparent for users).

asahilina commented 5 days ago

The logic to regenerate ld.so.cache exists in Flatpak, there's just no way for us to trigger an invalidation of the cache yet from our end

If this is for what Flatpak calls "unmaintained extensions" (extensions that do not exist in a libostree repo), I suggested a possible approach to cache invalidation in flatpak/flatpak#5948 (comment).

Sorry, I didn't realize you're the same person as in that bug! \^\^

smcv commented 5 days ago

[we run ldconfig once] per muvm startup, which for example would be once per Steam startup the way we package Steam today. Essentially "any time an x86 program is launched and no x86 programs are running yet" is the plan.

I think it would be best if you can generate a ld.so.cache once per muvm startup (or perhaps once per Steam startup), then: Steam startup costs a lot more than 2 seconds anyway, and having a correct ld.so.cache is already something that we document as one of our distro assumptions. And if the x86 rootfs, its addons, and how they are composed into one tree are under your control, then you are much better-placed than we are to know how often that cache needs to be regenerated.

This is similar to setting LD_LIBRARY_PATH but then we still have the arch problem. The correct search path is /lib:/usr/lib for 32-bit and /lib64:/usr/lib64 for 64-bit, there isn't one correct global value. If you make the env variable name contain the architecture, then that would work (and I'm happy to put this into muvm startup so it's transparent for users).

Yes ish, but ld.so already knows how to accept and ignore wrong-architecture libraries by parsing their ELF header (which is why we only need one LD_LIBRARY_PATH). So if we simplified this down to something like PRESSURE_VESSEL_FALLBACK_LIBRARY_PATH=/lib:/usr/lib:/lib64:/usr/lib64 in your case, the only harm that would do is that it could slightly change the loading behaviour if you somehow had an x86_64 library that only existed in /[usr/]lib or an i386 library that only existed in in /[usr/]lib64 - in which case my first suggestion would be "don't do that, then" :-)

smcv commented 5 days ago

I should have pointed out before, but: Steam is designed for x86 machines, and running Steam on an ARM system emulating an x86 is not really something that we support.

We've added code to pressure-vessel to cope with FEX oddities in the past, and I'm willing to add reasonable amounts of extra code to handle this scenario because what you're doing with system emulation is fascinating and terrifying :-) but in cases where it conflicts with other work, we will usually have to prioritize users of ordinary x86 distros.