ValveSoftware / steam-runtime

A runtime environment for Steam applications
Other
1.18k stars 86 forks source link

The environment variable `LD_LIBRARY_PATH` (when externally set) is not correctly honored. #274

Open felixdoerre opened 4 years ago

felixdoerre commented 4 years ago

Your system information

Please describe your issue in as much detail as possible:

As of commit 7f8e8411236551cdcebd1cfd09de3758c2c8015a an external LD_LIBRARY_PATH is not handled correctly anymore. When an external LD_LIBRARY_PATH is set, the semantics are that those directories take precedence over the paths hardcoded by ld.so.

When running an application with LD_LIBRARY_PATH the effective path is:

  1. LD_LIBRARY_PATH
  2. system libraries

With steam-runtime, the effective (observed) LD_LIBRARY_PATH is now constructed as follows:

  1. pinned steam-runtime libraries
  2. system libraries
  3. steam-runtime libraries
  4. external LD_LIBRARY_PATH

The correct (expected) order would be:

  1. pinned steam-runtime libraries
  2. external LD_LIBRARY_PATH
  3. system libraries
  4. steam-runtime libraries

Steps for reproducing this issue:

  1. run LD_LIBRARY_PATH=foo steam
  2. observe the LD_LIBRARY_PATH for the steam executable (e.g. by inspecting /proc/<pid of steam>/environ)

Concrete impact:

Correct handling of LD_LIBRARY_PATH is important for primus and primus-vk, as the activation of primus happens through overriding the system libGL.so.1 with the one provided by primus, by setting LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/primus:${LD_LIBRARY_PATH}, where the primus libGL.so.1 is located under /usr/lib/x86_64-linux-gnu/primus/libGL.so.1 effectively overriding the system-provided one (which is then internally loaded by the primus libGL.so.1).

With the mechanism the steam runtime uses, this breaks activation for primus, as now the LD_LIBRARY_PATH is:

  1. pinned steam-runtime libraries
  2. system libraries
  3. steam-runtime libraries
  4. primus LD_LIBRARY_PATH
  5. external LD_LIBRARY_PATH.

In this configuration the system libraries (number 2) take precedence over the primus libs (number 4). This prevents primus from activating and launching steam (or the games launched by steam) with a dedicated graphics card enabled.

Note that, when activating primus inside the steam runtime (e.g. by setting the game launch command to pvkrun %command%) the effective order:

  1. primus LD_LIBRARY_PATH
  2. pinned steam-runtime libraries
  3. system libraries
  4. steam-runtime libraries
  5. external LD_LIBRARY_PATH allows primus and the whole system to function even with the current settings for LD_LIBRARY_PATH.
kisak-valve commented 4 years ago

Duplicate of https://github.com/ValveSoftware/steam-for-linux/issues/4822.

felixdoerre commented 4 years ago

Aaah, right. Yes, I have only searched this repo (as the bug is here), so I did not find the other issue. However the situation has worsened, as previously the problem could be sidestepped with STEAM_RUNTIME_PREFER_HOST_LIBRARIES=0. However now STEAM_RUNTIME_PREFER_HOST_LIBRARIES has no effect anymore, right? Its also a bit concerning as the referenced issued has no activity since 2017. Also the fix seems rather trivial to me: I see two possibilities:

  1. Change the mechanic how steam_runtime_library_paths are used, by appending them to LD_LIBRARY_PATH instead of prepending them. This would have to be done here: https://github.com/ValveSoftware/steam-runtime/blob/5fc1299e8aecb569315ae0a625c57942be1d0b78/templates/run.sh#L101 and in steam.sh (I don't know how easy it is to make a change there)
  2. Just insert the "old" LD_LIBRARY_PATH additionally in the middle in run.sh, here: https://github.com/ValveSoftware/steam-runtime/blob/5fc1299e8aecb569315ae0a625c57942be1d0b78/templates/run.sh#L92 and then 2a. override LD_LIBRARY_PATH instead of prepending in steam.sh and run.sh 2b. just keep LD_LIBRARY_PATH handling in steam.sh and run.sh as-is.

option 2a is entirely possible touching this repo only. Would you welcome (and merge) a PR for that?

smcv commented 4 years ago

Would you welcome (and merge) a PR for that?

I agree that having run.sh override an external LD_LIBRARY_PATH is unexpected. It's a side-effect of how it works, rather than anything particularly deliberate - but having said that, part of what we want from the Steam Runtime is to impose a stable, predictable set of libraries, and obviously per-user overrides of the LD_LIBRARY_PATH work against that!

In an ideal world, we would be using the parts of a user-specified LD_LIBRARY_PATH that are part of a targeted, use-case-specific thing like Primus, but not using any parts of a user-specified LD_LIBRARY_PATH that are just overriding libraries generically at unpredictable versions. That's probably impossible, though - we can't tell which search path entries are good and which are bad.

If we do this, I think it's best done in a coordinated way between steam.sh and run.sh. The effort here is in figuring out the library search order we want, verifying that this is actually what we're getting, and checking for regressions afterwards: the actual mechanics of patching the scripts are the easy part. So, thanks for offering, but I don't think a PR is going to make this happen any quicker.

The other moving parts here are in the pressure-vessel container tool used for the Steam container runtime ("Steam Linux Runtime"), which we want to be able to rely on more in future. If we are looking at how run.sh deals with an externally-imposed LD_LIBRARY_PATH, we should make sure pressure-vessel is doing similar things - in particular, for your Primus use-case, presumably you want the container to steal Primus' libGL.so.1 from the host system, instead of the normal, system-wide libGL.so.1 that is used for random desktop apps?

smcv commented 4 years ago

previously the problem could be sidestepped with STEAM_RUNTIME_PREFER_HOST_LIBRARIES=0

... at the cost of causing many other problems (particularly for users of the Mesa open-source graphics drivers, which typically need versions of OS libraries that are years newer than the ones in the Steam Runtime).

There should be a solution for things like Primus, but STEAM_RUNTIME_PREFER_HOST_LIBRARIES=0 is definitely not it.

smcv commented 4 years ago

The correct (expected) order would be:

  1. pinned steam-runtime libraries
  2. external LD_LIBRARY_PATH
  3. system libraries
  4. steam-runtime libraries

In steam.sh, the external LD_LIBRARY_PATH is saved as SYSTEM_LD_LIBRARY_PATH. run.sh doesn't do that right now, but perhaps it should.

I wonder whether this is really the expected order, or whether this would be better:

  1. external LD_LIBRARY_PATH
  2. pinned steam-runtime libraries
  3. system libraries
  4. steam-runtime libraries

That would cluster all the Steam-Runtime-managed stuff together, and would be consistent with what happens when you put pvkrun %command% or env LD_LIBRARY_PATH=... %command% in the launch options.

The down side of doing this is that if you have something in your LD_LIBRARY_PATH that is going to break Steam or the game (whether that's because it's older than a backported library we ship, or because we expect a very specific ABI like we do for libcurl), letting it "win" is not what we want; so perhaps the order you suggested is indeed the best.

smcv commented 4 years ago

Now that GLVND is widespread, it's maybe a better and more future-proof way to override the choice of GL driver: instead of substituting an entire libGL.so.1, it would likely be better to go down a layer and tell GLVND to load an appropriate implementation (either NVIDIA proprietary, Mesa, or a Primus-like hybrid that uses the NVIDIA proprietary driver for the "expensive" operations and then copies the frames into the right places for output).

felixdoerre commented 4 years ago

Thanks for the replies.

in particular, for your Primus use-case, presumably you want the container to steal Primus' libGL.so.1 from the host system, instead of the normal, system-wide libGL.so.1 that is used for random desktop apps?

Yes, but not only that. But probably I don't understand the Pressure Vessels good enough. I haven't tried them yet. How do you "steal" libraries from the host system? Is the hosts filesystem available? Because for EGL or Vulkan you need various configuration files as well, and ideally I would want primus-vk to work as well, also inside the pressure vessels.

instead of substituting an entire libGL.so.1, it would likely be better to go down a layer and tell GLVND to load an appropriate implementation

Yes, that would probably be better. However GLVND was not widespread when Primus was developed and until now no-one has found the time to write it fresh for the glvnd scenario. And I am not really sure how smooth that would work. Primus-vk is currently (apart from a crude workaround for what I would consider a bug in the nvidia driver) just a simple Vulkan Layer. However that required changes in the loader as well, as the "exotic" usecase of a layer instantiating two graphics devices was not really thought through until the end.

The down side of doing this is that if you have something in your LD_LIBRARY_PATH that is going to break Steam or the game (whether that's because it's older than a backported library we ship, or because we expect a very specific ABI like we do for libcurl), letting it "win" is not what we want

I am not really sure that we should really care that much. Having something in the LD_LIBRARY_PATH is (in my opinion) very close to just installing it system-wide. Most users have root access on their system, so I'd treat a non-empty LD_LIBRARY_PATH equal to a user that has fiddled with their system libraries. And that's essentially what all applications do (that don't treat LD_LIBRARY_PATH specially).

Just some small rationale, why I chose the "expected" order the way I did: When the runtime explicitly parses out the system library paths from ldconfig's output, it wants to know the paths where libraries are "usually" resolved. And I'd argue that this logic is just incomplete, because "usually" the LD_LIBRARY_PATH is always consulted before the hard-coded paths from ldconfig.

smcv commented 4 years ago

Primus-vk is currently (apart from a crude workaround for what I would consider a bug in the nvidia driver) just a simple Vulkan Layer.

We will need to teach pressure-vessel about Vulkan layers to make this work. At the moment, it imports Vulkan drivers (ICDs) from the host system, but ignores layers.

But probably I don't understand the Pressure Vessels good enough. I haven't tried them yet. How do you "steal" libraries from the host system? Is the hosts filesystem available? Because for EGL or Vulkan you need various configuration files as well, and ideally I would want primus-vk to work as well, also inside the pressure vessels.

It's a container runtime system a bit like Flatpak, intended to give games a predictable library stack to run on. However, because the predictable library stack that is currently used by all Steam games (Steam Runtime 1 'scout') is ancient, it does not contain a graphics driver that will support anyone's hardware. This means we have to import the graphics drivers (GLX, EGL, Vulkan, VDPAU, VA-API) from the host system, together with any of their dependencies that are newer than what's in the container (notably glibc and libstdc++), and arrange for them to be used in preference to the container's versions.

Parts of the host filesystem are available, but at a different filesystem location (in /run/host), so we have to rewrite the JSON files describing EGL and Vulkan drivers (ICDs) to point to the driver's location inside the container. We already do this for the ICDs, but not for layers yet.

At some point in the future, when all this is more reliable, we'll start offering a second, newer, container-only library stack (Steam Runtime 2 'soldier'), and newer games will start to be compiled against that instead of scout. However, because the library stack is meant to be a fixed point but the host graphics drivers need to be at least as recent as the user's hardware, we will still need to combine 'soldier' with the host system's graphics stack to get the final environment to be used for the game.

See https://github.com/ValveSoftware/steam-runtime/tree/master/doc and https://archive.fosdem.org/2020/schedule/event/containers_steam/ for more details.

smcv commented 4 years ago

For reference, what you're currently using (and proposing changes for) is: https://github.com/ValveSoftware/steam-runtime/blob/master/doc/possible-designs.md#2018-ld_library_path-scout-runtime

and the container-based setup is: https://github.com/ValveSoftware/steam-runtime/blob/master/doc/possible-designs.md#2018-ld_library_path-scout-runtime--2019-pressure-vessel-scout-platform

felixdoerre commented 4 years ago

Ok, thanks for the reference. Personally I have not need to run steam itself with primus/pvkrun, as I am totally happy with just wrapping the individual games. It's a bit annoying to add pvkrun to all individual games' command, but that's ok for me.

So I guess we are trying to solve 2 (rather distinct) problems:

  1. get pvkrun/primusrun into the LD_LIBRARY_PATH-runtime: Here adding the LD_LIBRARY_PATH where I suggested (essentially treating it as "host given"), would solve the problem. I am not sure how common users are that individually mess up their LD_LIBRARY_PATH, but I would assume they are rare.
  2. get pvkrun/primusrun into the pressure vessel: I believe we need to "install" primus/pvkrun inside the pressure vessel. If we accept that steam is fully aware of pimus/pvkrun one could think of configurable run options (like e.g. lutris).

We already do this for the ICDs, but not for layers yet.

Is the code that tries to scrape the ICDs of the host system openly available? I didn't see that.

smcv commented 4 years ago

Is the code that tries to scrape the ICDs of the host system openly available?

Not as a git repository yet (there are some infrastructure issues that need to be solved first), but source code is available: https://repo.steampowered.com/steamrt/pool/main/s/steam-runtime-tools/ (take the latest .tar.gz).

In older versions, steam-runtime-tools provided infrastructure and diagnostic tools, and https://repo.steampowered.com/steamrt/pool/main/p/pressure-vessel/ did the container setup. Newer versions have been combined into one project, so pressure-vessel is just a subdirectory in the steam-runtime-tools source release.

smcv commented 4 years ago

get pvkrun/primusrun into the pressure vessel

We don't need the wrapper script (and we can't rely on arbitrary programs from the host working in the container anyway) - but we do need the libraries that it causes the game to load, and their dependencies.

My understanding is that Primus is mainly relevant for the Fermi/GF1xx (GeForce 4xx) generation of GPUs when using the 390.x "legacy" proprietary drivers, because:

Is that accurate?

felixdoerre commented 4 years ago

We don't need the wrapper script

How would a user then start primus/primus_vk in the pressure vessel? Do you want to require a user who wants to run games via primus/primus_vk to also invoke steam via primus/primus_vk. I'd prefer to run steam in non-hybrid mode and use that only for individual games. That's why I would like explicit support for primus/primus_vk inside the pressure vessel. For activating primus/primus_vk from inside the pressure vessel you would need to mount the bumblebee socket inside the container. How is this generally handled? Are the X-Server sockets bind-monted inside? What about pulseaudio?

Your observations regarding the state of primus is accurate in theory, but not fully in practice:

(Wayland also plays into this whole mess.... Primus-Vk works flawlessly on wayland)

felixdoerre commented 4 years ago

I've just opened the pressure-vessel source. I will look at it in more detail, but right now I noticed a particularity: You copy icd json files from /usr/share/vulkan/icd.d. However the vulkan loader also looks in /etc/vulkan/icd.d/, which some distributions use to exclusively store all their icds. I assume that you left out ~/.local/share/vulkan/icd.d deliberately, as you want to exclude changes by the current user explicitly, right?

smcv commented 4 years ago

The fundamental problem here is that Primus is complicated, and putting games in containers or another predictable runtime environment is complicated, so combining the two has a lot of moving parts. The more code paths we have, the more bugs we have; so don't expect this to be something that can be implemented immediately.

Inserting Primus into the traditional LD_LIBRARY_PATH-based runtime is (as you've noticed) a lot quicker than putting it into pressure-vessel, but the traditional LD_LIBRARY_PATH-based runtime tends to be rather fragile (all sorts of OS configurations can break it, and changes can have unexpected consequences), so we need to be careful not to cause regressions.

You copy icd json files from /usr/share/vulkan/icd.d. However the vulkan loader also looks in /etc/vulkan/icd.d/, which some distributions use to exclusively store all their icds

srt_system_info_list_vulkan_icds() should look in all the same places as the reference Vulkan loader. Please open a separate issue if it doesn't. The steam-runtime-system-info diagnostic tool looks in the same locations (it calls the same function), and should list all your Vulkan ICDs.

pressure-vessel writes them into /overrides/share/vulkan/icd.d, and configures the Vulkan loader to look there exclusively.

How would a user then start primus/primus_vk in the pressure vessel?

It can still be in a game's launch options, which would wrap the whole container process tree with primus (at least in the latest development versions, in conjunction with the Steam beta - I'm not so sure about older versions).

Are the X-Server sockets bind-monted inside?

Yes. If Primus uses multiple X servers then we might have to bind-mount the entire /tmp/.X11-unix rather than individual server sockets, which is unfortunate but doable.

felixdoerre commented 4 years ago

pressure-vessel writes them into /overrides/share/vulkan/icd.d, and configures the Vulkan loader to look there exclusively.

Ooh yes, sorry. I looked too quick.

so we need to be careful not to cause regressions.

Ok, so how do we do that? I'd like to focus on the LD_LIBRARY_PATH-based runtime for now and try to tackle pressure vessel when that works.

Just a quick note, in case it interests you: I remembered you saying in the talk that pressure vessel currently does not create a pid namespace as steam gets confused when the pids of its games are wrong. I have another reason what might be tricky: (I run my steam + all games started from steam in a small container [created by a 200 line shellscript] mounting large parts of the host-os read-only and individually mounting X-Server and pulseaudio, and it works surprisingly well. I use a pid-ns to isolate the whole steam tree from the rest of the system) When mounting pulseaudio inside a new pidns where the pulseaudio server is not visible, pulseaudio --check will fail, even though pulseaudio itself is working perfectly fine. This is caused by pulseaudio checking the availability of its server by looking inside the pulseaudio.pid file and not checking the socket. I had at least one game, that invoked pulseaudio --check to see if it could use pulseaudio, and if that wasn't the case, fell back to a broken alsa audio output. Summing it up: the game developers said "it works on Ubuntu, so we can't do anything more for you", and the only possibility that remained was providing a new pulseaudio executable inside the container, that faked the output of --check-

felixdoerre commented 3 years ago

Just as this issue has been dormant for some time now:

Could we please fix the LD_LIBRARY_PATH-based runtime now? Waiting will not make it any more clear how the correct change will look like and will not make the LD_LIBRARY_PATH-based runtime less fragile. If you think that any steps/testing/"simpler version of the change" is needed before that I'd be glad to help there. What would you think are the next steps for adjusting this behavior?

I'd be also happy to help integrating bumblebee, primus and primus-vk into the pressure-vessel system, but I guess this should be discussed in a separate issue. Setting up primus and primus-vk is not that complicated. (primus probably a bit more than primus_vk even). For primus you need to make the primus LD_LIBRARY_PATH-Folder the bumblebee socket and the socket of the ephemerally spawned X-Server available. For primus_vk it's just a vulkan-layer that would need to be copied with its one shared object and its json file. However primus_vk depends (to some extent) on primus, so the primus-complexity is "included" there.

smcv commented 3 years ago

Could we please fix the LD_LIBRARY_PATH-based runtime now?

It's in progress, but there are other, higher-priority things also in progress.

TTimo commented 3 years ago

Next beta client update (> Oct 28) will change how the host's LD_LIBRARY_PATH settings are used to setup the scout runtime LD_LIBRARY_PATH (pinned libraries, then host LD_LIBRARY_PATH, then the rest of the scout runtime library paths).

wferi commented 3 years ago

Primus-vk is currently [...] just a simple Vulkan Layer.

We will need to teach pressure-vessel about Vulkan layers to make this work. At the moment, it imports Vulkan drivers (ICDs) from the host system, but ignores layers.

At least the soldier pressure-vessel uses the --import-vulkan-layers option. Is that a new development (which your comment predates) or something totally unrelated? (The WRAP_GUI didn't start a terminal inside the container for me, so I couldn't check it out in detail yet.)

I'm investigating this because my kids' favourite games depend on Proton 5.13, which requires the soldier runtime and pressure-vessel AFAIUI, which is incompatible with Primus(-VK), which is the only practical way of using the discrete card (short of upgrading to bullseye and keeping the dedicated GPU powered up all the time, right?)

smcv commented 3 years ago

Is that a new development (which your comment predates)

Yes. pressure-vessel now tries to pull in Vulkan layers, although there have been several bugs in that after the initial implementation, which we're continuing to fix (as with everything else in this ecosystem, it's more complicated than it looks).

wferi commented 3 years ago

Yay, it's great to see progress, thanks for the info! It certainly looks complicated, so no need to look for excuses; I botched much simpler things on first try. Second, even. :) I'm happy to help with testing (or whatever) if needed.

fxthomas commented 3 years ago

This wasn't easily searchable, so for other people stumbling on this: nvidia-xrun also sets LD_LIBRARY_PATH and doesn't work with the new runtime. Is there any workaround that we could use to go back to the old behavior temporarily?

smcv commented 3 years ago

@fxthomas Please open a separate issue with full details, including as much context as you can provide for what nvidia-xrun does and why it is necessary.

Getting arbitrary third-party code across a container boundary is not straightforward, so if we let the scope of an issue grow to "it isn't possible to inject arbitrary third-party code into every game process" then it is unlikely to be feasible to resolve it. If the scope of an issue is something narrower, then that might be solvable.

fxthomas commented 3 years ago

Please open a separate issue with full details, including as much context as you can provide for what nvidia-xrun does and why it is necessary.

I will, thanks for the clarification!