Open felixdoerre opened 4 years ago
Aaah, right. Yes, I have only searched this repo (as the bug is here), so I did not find the other issue.
However the situation has worsened, as previously the problem could be sidestepped with STEAM_RUNTIME_PREFER_HOST_LIBRARIES=0
. However now STEAM_RUNTIME_PREFER_HOST_LIBRARIES
has no effect anymore, right?
Its also a bit concerning as the referenced issued has no activity since 2017.
Also the fix seems rather trivial to me: I see two possibilities:
steam_runtime_library_paths
are used, by appending them to LD_LIBRARY_PATH instead of prepending them. This would have to be done here: https://github.com/ValveSoftware/steam-runtime/blob/5fc1299e8aecb569315ae0a625c57942be1d0b78/templates/run.sh#L101 and in steam.sh (I don't know how easy it is to make a change there)run.sh
, here: https://github.com/ValveSoftware/steam-runtime/blob/5fc1299e8aecb569315ae0a625c57942be1d0b78/templates/run.sh#L92 and then
2a. override LD_LIBRARY_PATH instead of prepending in steam.sh and run.sh
2b. just keep LD_LIBRARY_PATH handling in steam.sh and run.sh as-is.option 2a is entirely possible touching this repo only. Would you welcome (and merge) a PR for that?
Would you welcome (and merge) a PR for that?
I agree that having run.sh
override an external LD_LIBRARY_PATH
is unexpected. It's a side-effect of how it works, rather than anything particularly deliberate - but having said that, part of what we want from the Steam Runtime is to impose a stable, predictable set of libraries, and obviously per-user overrides of the LD_LIBRARY_PATH
work against that!
In an ideal world, we would be using the parts of a user-specified LD_LIBRARY_PATH
that are part of a targeted, use-case-specific thing like Primus, but not using any parts of a user-specified LD_LIBRARY_PATH
that are just overriding libraries generically at unpredictable versions. That's probably impossible, though - we can't tell which search path entries are good and which are bad.
If we do this, I think it's best done in a coordinated way between steam.sh
and run.sh
. The effort here is in figuring out the library search order we want, verifying that this is actually what we're getting, and checking for regressions afterwards: the actual mechanics of patching the scripts are the easy part. So, thanks for offering, but I don't think a PR is going to make this happen any quicker.
The other moving parts here are in the pressure-vessel container tool used for the Steam container runtime ("Steam Linux Runtime"), which we want to be able to rely on more in future. If we are looking at how run.sh
deals with an externally-imposed LD_LIBRARY_PATH
, we should make sure pressure-vessel is doing similar things - in particular, for your Primus use-case, presumably you want the container to steal Primus' libGL.so.1
from the host system, instead of the normal, system-wide libGL.so.1
that is used for random desktop apps?
previously the problem could be sidestepped with
STEAM_RUNTIME_PREFER_HOST_LIBRARIES=0
... at the cost of causing many other problems (particularly for users of the Mesa open-source graphics drivers, which typically need versions of OS libraries that are years newer than the ones in the Steam Runtime).
There should be a solution for things like Primus, but STEAM_RUNTIME_PREFER_HOST_LIBRARIES=0
is definitely not it.
The correct (expected) order would be:
- pinned steam-runtime libraries
- external LD_LIBRARY_PATH
- system libraries
- steam-runtime libraries
In steam.sh
, the external LD_LIBRARY_PATH
is saved as SYSTEM_LD_LIBRARY_PATH
. run.sh
doesn't do that right now, but perhaps it should.
I wonder whether this is really the expected order, or whether this would be better:
That would cluster all the Steam-Runtime-managed stuff together, and would be consistent with what happens when you put pvkrun %command%
or env LD_LIBRARY_PATH=... %command%
in the launch options.
The down side of doing this is that if you have something in your LD_LIBRARY_PATH
that is going to break Steam or the game (whether that's because it's older than a backported library we ship, or because we expect a very specific ABI like we do for libcurl), letting it "win" is not what we want; so perhaps the order you suggested is indeed the best.
Now that GLVND is widespread, it's maybe a better and more future-proof way to override the choice of GL driver: instead of substituting an entire libGL.so.1
, it would likely be better to go down a layer and tell GLVND to load an appropriate implementation (either NVIDIA proprietary, Mesa, or a Primus-like hybrid that uses the NVIDIA proprietary driver for the "expensive" operations and then copies the frames into the right places for output).
Thanks for the replies.
in particular, for your Primus use-case, presumably you want the container to steal Primus' libGL.so.1 from the host system, instead of the normal, system-wide libGL.so.1 that is used for random desktop apps?
Yes, but not only that. But probably I don't understand the Pressure Vessels good enough. I haven't tried them yet. How do you "steal" libraries from the host system? Is the hosts filesystem available? Because for EGL or Vulkan you need various configuration files as well, and ideally I would want primus-vk to work as well, also inside the pressure vessels.
instead of substituting an entire libGL.so.1, it would likely be better to go down a layer and tell GLVND to load an appropriate implementation
Yes, that would probably be better. However GLVND was not widespread when Primus was developed and until now no-one has found the time to write it fresh for the glvnd scenario. And I am not really sure how smooth that would work. Primus-vk is currently (apart from a crude workaround for what I would consider a bug in the nvidia driver) just a simple Vulkan Layer. However that required changes in the loader as well, as the "exotic" usecase of a layer instantiating two graphics devices was not really thought through until the end.
The down side of doing this is that if you have something in your LD_LIBRARY_PATH that is going to break Steam or the game (whether that's because it's older than a backported library we ship, or because we expect a very specific ABI like we do for libcurl), letting it "win" is not what we want
I am not really sure that we should really care that much. Having something in the LD_LIBRARY_PATH
is (in my opinion) very close to just installing it system-wide. Most users have root access on their system, so I'd treat a non-empty LD_LIBRARY_PATH
equal to a user that has fiddled with their system libraries. And that's essentially what all applications do (that don't treat LD_LIBRARY_PATH
specially).
Just some small rationale, why I chose the "expected" order the way I did: When the runtime explicitly parses out the system library paths from ldconfig
's output, it wants to know the paths where libraries are "usually" resolved. And I'd argue that this logic is just incomplete, because "usually" the LD_LIBRARY_PATH
is always consulted before the hard-coded paths from ldconfig
.
Primus-vk is currently (apart from a crude workaround for what I would consider a bug in the nvidia driver) just a simple Vulkan Layer.
We will need to teach pressure-vessel about Vulkan layers to make this work. At the moment, it imports Vulkan drivers (ICDs) from the host system, but ignores layers.
But probably I don't understand the Pressure Vessels good enough. I haven't tried them yet. How do you "steal" libraries from the host system? Is the hosts filesystem available? Because for EGL or Vulkan you need various configuration files as well, and ideally I would want primus-vk to work as well, also inside the pressure vessels.
It's a container runtime system a bit like Flatpak, intended to give games a predictable library stack to run on. However, because the predictable library stack that is currently used by all Steam games (Steam Runtime 1 'scout') is ancient, it does not contain a graphics driver that will support anyone's hardware. This means we have to import the graphics drivers (GLX, EGL, Vulkan, VDPAU, VA-API) from the host system, together with any of their dependencies that are newer than what's in the container (notably glibc and libstdc++), and arrange for them to be used in preference to the container's versions.
Parts of the host filesystem are available, but at a different filesystem location (in /run/host
), so we have to rewrite the JSON files describing EGL and Vulkan drivers (ICDs) to point to the driver's location inside the container. We already do this for the ICDs, but not for layers yet.
At some point in the future, when all this is more reliable, we'll start offering a second, newer, container-only library stack (Steam Runtime 2 'soldier'), and newer games will start to be compiled against that instead of scout. However, because the library stack is meant to be a fixed point but the host graphics drivers need to be at least as recent as the user's hardware, we will still need to combine 'soldier' with the host system's graphics stack to get the final environment to be used for the game.
See https://github.com/ValveSoftware/steam-runtime/tree/master/doc and https://archive.fosdem.org/2020/schedule/event/containers_steam/ for more details.
For reference, what you're currently using (and proposing changes for) is: https://github.com/ValveSoftware/steam-runtime/blob/master/doc/possible-designs.md#2018-ld_library_path-scout-runtime
and the container-based setup is: https://github.com/ValveSoftware/steam-runtime/blob/master/doc/possible-designs.md#2018-ld_library_path-scout-runtime--2019-pressure-vessel-scout-platform
Ok, thanks for the reference. Personally I have not need to run steam itself with primus/pvkrun, as I am totally happy with just wrapping the individual games. It's a bit annoying to add pvkrun
to all individual games' command, but that's ok for me.
So I guess we are trying to solve 2 (rather distinct) problems:
LD_LIBRARY_PATH
-runtime: Here adding the LD_LIBRARY_PATH
where I suggested (essentially treating it as "host given"), would solve the problem. I am not sure how common users are that individually mess up their LD_LIBRARY_PATH
, but I would assume they are rare.We already do this for the ICDs, but not for layers yet.
Is the code that tries to scrape the ICDs of the host system openly available? I didn't see that.
Is the code that tries to scrape the ICDs of the host system openly available?
Not as a git repository yet (there are some infrastructure issues that need to be solved first), but source code is available: https://repo.steampowered.com/steamrt/pool/main/s/steam-runtime-tools/ (take the latest .tar.gz
).
In older versions, steam-runtime-tools provided infrastructure and diagnostic tools, and https://repo.steampowered.com/steamrt/pool/main/p/pressure-vessel/ did the container setup. Newer versions have been combined into one project, so pressure-vessel is just a subdirectory in the steam-runtime-tools source release.
get pvkrun/primusrun into the pressure vessel
We don't need the wrapper script (and we can't rely on arbitrary programs from the host working in the container anyway) - but we do need the libraries that it causes the game to load, and their dependencies.
My understanding is that Primus is mainly relevant for the Fermi/GF1xx (GeForce 4xx) generation of GPUs when using the 390.x "legacy" proprietary drivers, because:
Is that accurate?
We don't need the wrapper script
How would a user then start primus/primus_vk in the pressure vessel? Do you want to require a user who wants to run games via primus/primus_vk to also invoke steam via primus/primus_vk. I'd prefer to run steam in non-hybrid mode and use that only for individual games. That's why I would like explicit support for primus/primus_vk inside the pressure vessel. For activating primus/primus_vk from inside the pressure vessel you would need to mount the bumblebee socket inside the container. How is this generally handled? Are the X-Server sockets bind-monted inside? What about pulseaudio?
Your observations regarding the state of primus is accurate in theory, but not fully in practice:
(Wayland also plays into this whole mess.... Primus-Vk works flawlessly on wayland)
I've just opened the pressure-vessel source. I will look at it in more detail, but right now I noticed a particularity:
You copy icd json files from /usr/share/vulkan/icd.d
. However the vulkan loader also looks in /etc/vulkan/icd.d/
, which some distributions use to exclusively store all their icds. I assume that you left out ~/.local/share/vulkan/icd.d
deliberately, as you want to exclude changes by the current user explicitly, right?
The fundamental problem here is that Primus is complicated, and putting games in containers or another predictable runtime environment is complicated, so combining the two has a lot of moving parts. The more code paths we have, the more bugs we have; so don't expect this to be something that can be implemented immediately.
Inserting Primus into the traditional LD_LIBRARY_PATH
-based runtime is (as you've noticed) a lot quicker than putting it into pressure-vessel, but the traditional LD_LIBRARY_PATH
-based runtime tends to be rather fragile (all sorts of OS configurations can break it, and changes can have unexpected consequences), so we need to be careful not to cause regressions.
You copy icd json files from /usr/share/vulkan/icd.d. However the vulkan loader also looks in /etc/vulkan/icd.d/, which some distributions use to exclusively store all their icds
srt_system_info_list_vulkan_icds()
should look in all the same places as the reference Vulkan loader. Please open a separate issue if it doesn't. The steam-runtime-system-info
diagnostic tool looks in the same locations (it calls the same function), and should list all your Vulkan ICDs.
pressure-vessel writes them into /overrides/share/vulkan/icd.d
, and configures the Vulkan loader to look there exclusively.
How would a user then start primus/primus_vk in the pressure vessel?
It can still be in a game's launch options, which would wrap the whole container process tree with primus (at least in the latest development versions, in conjunction with the Steam beta - I'm not so sure about older versions).
Are the X-Server sockets bind-monted inside?
Yes. If Primus uses multiple X servers then we might have to bind-mount the entire /tmp/.X11-unix
rather than individual server sockets, which is unfortunate but doable.
pressure-vessel writes them into
/overrides/share/vulkan/icd.d
, and configures the Vulkan loader to look there exclusively.
Ooh yes, sorry. I looked too quick.
so we need to be careful not to cause regressions.
Ok, so how do we do that? I'd like to focus on the LD_LIBRARY_PATH
-based runtime for now and try to tackle pressure vessel when that works.
Just a quick note, in case it interests you: I remembered you saying in the talk that pressure vessel currently does not create a pid namespace as steam gets confused when the pids of its games are wrong. I have another reason what might be tricky: (I run my steam + all games started from steam in a small container [created by a 200 line shellscript] mounting large parts of the host-os read-only and individually mounting X-Server and pulseaudio, and it works surprisingly well. I use a pid-ns to isolate the whole steam tree from the rest of the system) When mounting pulseaudio inside a new pidns where the pulseaudio server is not visible, pulseaudio --check
will fail, even though pulseaudio itself is working perfectly fine. This is caused by pulseaudio checking the availability of its server by looking inside the pulseaudio.pid file and not checking the socket. I had at least one game, that invoked pulseaudio --check
to see if it could use pulseaudio, and if that wasn't the case, fell back to a broken alsa audio output. Summing it up: the game developers said "it works on Ubuntu, so we can't do anything more for you", and the only possibility that remained was providing a new pulseaudio
executable inside the container, that faked the output of --check
-
Just as this issue has been dormant for some time now:
Could we please fix the LD_LIBRARY_PATH
-based runtime now? Waiting will not make it any more clear how the correct change will look like and will not make the LD_LIBRARY_PATH
-based runtime less fragile. If you think that any steps/testing/"simpler version of the change" is needed before that I'd be glad to help there. What would you think are the next steps for adjusting this behavior?
I'd be also happy to help integrating bumblebee, primus and primus-vk into the pressure-vessel system, but I guess this should be discussed in a separate issue. Setting up primus and primus-vk is not that complicated. (primus probably a bit more than primus_vk even). For primus you need to make the primus LD_LIBRARY_PATH-Folder the bumblebee socket and the socket of the ephemerally spawned X-Server available. For primus_vk it's just a vulkan-layer that would need to be copied with its one shared object and its json file. However primus_vk depends (to some extent) on primus, so the primus-complexity is "included" there.
Could we please fix the LD_LIBRARY_PATH-based runtime now?
It's in progress, but there are other, higher-priority things also in progress.
Next beta client update (> Oct 28) will change how the host's LD_LIBRARY_PATH settings are used to setup the scout runtime LD_LIBRARY_PATH (pinned libraries, then host LD_LIBRARY_PATH, then the rest of the scout runtime library paths).
Primus-vk is currently [...] just a simple Vulkan Layer.
We will need to teach pressure-vessel about Vulkan layers to make this work. At the moment, it imports Vulkan drivers (ICDs) from the host system, but ignores layers.
At least the soldier pressure-vessel uses the --import-vulkan-layers
option. Is that a new development (which your comment predates) or something totally unrelated? (The WRAP_GUI didn't start a terminal inside the container for me, so I couldn't check it out in detail yet.)
I'm investigating this because my kids' favourite games depend on Proton 5.13, which requires the soldier runtime and pressure-vessel AFAIUI, which is incompatible with Primus(-VK), which is the only practical way of using the discrete card (short of upgrading to bullseye and keeping the dedicated GPU powered up all the time, right?)
Is that a new development (which your comment predates)
Yes. pressure-vessel now tries to pull in Vulkan layers, although there have been several bugs in that after the initial implementation, which we're continuing to fix (as with everything else in this ecosystem, it's more complicated than it looks).
Yay, it's great to see progress, thanks for the info! It certainly looks complicated, so no need to look for excuses; I botched much simpler things on first try. Second, even. :) I'm happy to help with testing (or whatever) if needed.
This wasn't easily searchable, so for other people stumbling on this: nvidia-xrun
also sets LD_LIBRARY_PATH
and doesn't work with the new runtime. Is there any workaround that we could use to go back to the old behavior temporarily?
@fxthomas Please open a separate issue with full details, including as much context as you can provide for what nvidia-xrun
does and why it is necessary.
Getting arbitrary third-party code across a container boundary is not straightforward, so if we let the scope of an issue grow to "it isn't possible to inject arbitrary third-party code into every game process" then it is unlikely to be feasible to resolve it. If the scope of an issue is something narrower, then that might be solvable.
Please open a separate issue with full details, including as much context as you can provide for what nvidia-xrun does and why it is necessary.
I will, thanks for the clarification!
Your system information
Please describe your issue in as much detail as possible:
As of commit 7f8e8411236551cdcebd1cfd09de3758c2c8015a an external
LD_LIBRARY_PATH
is not handled correctly anymore. When an externalLD_LIBRARY_PATH
is set, the semantics are that those directories take precedence over the paths hardcoded byld.so
.When running an application with
LD_LIBRARY_PATH
the effective path is:LD_LIBRARY_PATH
With steam-runtime, the effective (observed)
LD_LIBRARY_PATH
is now constructed as follows:The correct (expected) order would be:
Steps for reproducing this issue:
/proc/<pid of steam>/environ
)Concrete impact:
Correct handling of LD_LIBRARY_PATH is important for primus and primus-vk, as the activation of primus happens through overriding the system
libGL.so.1
with the one provided by primus, by settingLD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/primus:${LD_LIBRARY_PATH}
, where the primuslibGL.so.1
is located under/usr/lib/x86_64-linux-gnu/primus/libGL.so.1
effectively overriding the system-provided one (which is then internally loaded by the primuslibGL.so.1
).With the mechanism the steam runtime uses, this breaks activation for primus, as now the LD_LIBRARY_PATH is:
In this configuration the system libraries (number 2) take precedence over the primus libs (number 4). This prevents primus from activating and launching steam (or the games launched by steam) with a dedicated graphics card enabled.
Note that, when activating primus inside the steam runtime (e.g. by setting the game launch command to
pvkrun %command%
) the effective order:LD_LIBRARY_PATH
.