NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 43 forks source link

egl needs an early out to prevent waking the dGPU unnecessarily #89

Open flukejones opened 9 months ago

flukejones commented 9 months ago

On the last two/three years of hybrid laptops, notably Nvidia RTX20xx++ onwards these machines tend to have a better/deeper suspend function which puts the dgpu in to a very low power state when unused.

Combined with glvnd, this introduces a lag or 1-2 seconds while the dgpu wakes in response to queries. Even if it remains unused and the iGPU is used instead. For example opening Nautilus file manager is delayed 1-2s while the dGPU wakes. For a lot of apps that use glvnd this ends up being a bad UX.

A lot of folks are working around this with __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json.

I reported this here some time ago

CosmicFusion commented 8 months ago

yeah, it hurts battery life too

having the gpu wakeup and blast it's fans every time an app is open

erik-kz commented 8 months ago

This should be fixed by https://github.com/NVIDIA/egl-wayland/commit/ba6c38ad74cf0ef6ec4d7934f68c17a7a2d460ca

flukejones commented 8 months ago

This should be fixed by ba6c38a

Seems a bit hit and miss, but this is likely to be due to how some apps (like Firefox, Vscode, Geary, Evolution) maybe handle GPU stuff. These apps will still wake the GPU, but other apps like Nautilus no-longer do this.

FiestaLake commented 8 months ago
Gert-dev commented 8 months ago

Seems a bit hit and miss, but this is likely to be due to how some apps (like Firefox, Vscode, Geary, Evolution) maybe handle GPU stuff. These apps will still wake the GPU, but other apps like Nautilus no-longer do this.

Nautilus 45 still opens the GPU with the latest egl-wayland

I see the same behaviour as the first comment with applications such as VSCode (even when using the Wayland backend), but not the last: GTK4 apps that were previously problematic such as Nautilus now no longer start the GPU or have the noticeable delay spinning up - also confirmed by monitoring the dGPU state using watch cat /sys/class/drm/card*/device/power_state.

Might be worth mentioning for completeness that if the app in question is running in Flatpak, it's not yet fixed likely because the newest release of this library hasn't landed in the base runtimes yet.

FiestaLake commented 8 months ago
  • Nautilus 45 still opens the GPU with the latest egl-wayland

https://youtu.be/gKYoFEvtUJ4

kbrenneman commented 8 months ago

Yeah, anything with Flatpak would need an update to its runtime environment to pick up an updated egl-wayland library.

It might be possible to work around that by using flatpak override --filesystem to map the host's copy of libnvidia-egl-wayland.so.1 through to the container, though at that point it's probably easier to just use the __EGL_VENDOR_LIBRARY_FILENAMES workaround instead.

For other applications, if the app itself (or some other library) tries to call eglQueryDevicesEXT on its own, then it would run into the same problem. Firefox might do that, but I couldn't say for sure -- I think the last time I looked at Firefox's GL code was before Wayland even existed. It would surprise me if something like Geary or Evolution did that, though.

kbrenneman commented 8 months ago

Now that I think about it, if an application calls eglGetDisplay(NULL), or eglGetPlatformDisplay with EGL_PLATFORM_DEVICE_EXT or EGL_PLATFORM_SURFACELESS_MESA then that would also cause the NVIDIA GPU to wake up.

All of those would produce a headless EGLDisplay, without a windowing system associated with it. And without a windowing system, the driver has no way to know which device is driving the desktop.

Gert-dev commented 8 months ago

https://youtu.be/gKYoFEvtUJ4

That's indeed weird - for me it doesn't bring the dGPU out of the D3Cold state. Since I'm assuming Nautilus isn't the experimental Flatpak version, could it be that you have some kind of specific configuration in place that makes the NVIDIA GPU your primary (card0) one? I notice that for me NVIDIA dGPU is card1 and the Intel iGPU card0. Not sure if this has impact anywhere.

For other applications, if the app itself (or some other library) tries to call eglQueryDevicesEXT on its own, then it would run into the same problem. ...

That indeed makes sense, I assume in these cases we'd need to create the relevant issue reports for those projects separately since this is out of egl-wayland's hands?

Firefox and Electron make some sense because IIRC they also handle some iGPU/dGPU 'placement' for things such as WebGL, so it wouldn't surprise me if the underlying code is also querying the available GPUs for that.

I'm also wondering, though, if these specific remaining issues are then also a problem for hybrid GPU setups with an AMD or even Intel dGPU? I have none to test currently, but it might be interesting to mention in upstream reports and make it more testable for developers.

kbrenneman commented 8 months ago

That indeed makes sense, I assume in these cases we'd need to create the relevant issue reports for those projects separately since this is out of egl-wayland's hands?

Most likely, yes. If an app actually does just need to do offscreen rendering, though, then there isn't really a good way to do that without running into this. Either it calls something like eglGetDisplay(NULL) and lets implementation pick a device (which would result the NVIDIA driver wake up a GPU), or it would use EGL_EXT_platform_device or EGL_EXT_explicit_device, which would require calling eglQueryDevicesEXT anyway.

I'm also wondering, though, if these specific remaining issues are then also a problem for hybrid GPU setups with an AMD or even Intel dGPU? I have none to test currently, but it might be interesting to mention in upstream reports and make it more testable for developers.

Hard to say. If the driver for the dGPU is Mesa, then it would depend on how Mesa handles device enumeration and selection internally.

kbrenneman commented 8 months ago

I wonder if the GPU offloading configuration proposal for libglvnd could help with this?

Most of the design for that would be about right, but I'll have to think about if I could tweak that interface to avoid unnecessary internal eglQueryDeviceEXT calls.

FiestaLake commented 8 months ago

https://youtu.be/gKYoFEvtUJ4

That's indeed weird - for me it doesn't bring the dGPU out of the D3Cold state. Since I'm assuming Nautilus isn't the experimental Flatpak version, could it be that you have some kind of specific configuration in place that makes the NVIDIA GPU your primary (card0) one? I notice that for me NVIDIA dGPU is card1 and the Intel iGPU card0. Not sure if this has impact anywhere.

Yes, it's the native nautilus package from Arch. In my case, most of times NVIDIA dGPU is card0 and the AMD iGPU is card1, though sometimes reversion happens. Haven't done any changes.

kbrenneman commented 8 months ago

It just occurred to me that the NVIDIA GBM library has the same problem of calling eglQueryDevices right away to try to find a matching device, so anything that tries to use EGL_KHR_platform_gbm would run into this as well. I'd be surprised if any application actually used both EGL_KHR_platform_gbm and EGL_KHR_platform_wayland, though.

But, disabling one or both of the wayland and GBM platform libraries would be a way to determine if the application is doing something directly to access an NVIDIA device, or if that's still coming from one of the platform libraries.

The __EGL_EXTERNAL_PLATFORM_CONFIG_DIRS and __EGL_EXTERNAL_PLATFORM_CONFIG_FILENAMES environment variables can control which platform libraries get loaded, like so:

# Disable all platform libraries
__EGL_EXTERNAL_PLATFORM_CONFIG_DIRS=/some/nonexistant/path /path/to/program
# Only load the GBM platform library
__EGL_EXTERNAL_PLATFORM_CONFIG_FILENAMES=/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json /path/to/program
erik-kz commented 8 months ago

I'd be surprised if any application actually used both EGL_KHR_platform_gbm and EGL_KHR_platform_wayland, though.

I believe recent versions of WebKit will do this. The web process uses GBM while the GUI process uses Wayland or X11.

marcinx64 commented 8 months ago

Hi, I've noticed that some apps are broken when applying ICD json file order workaround, either they are not opening: qflipper

Or partially broken with some UI elements not being displayed: egl_wa

Temporarily removing WA makes everything work again (except for waking up NVIDIA GPU): egl_no_wa

Is it something related to those apps/flatpak runtime? Or is it also a bug in EGL?

kbrenneman commented 8 months ago

Is it something related to those apps/flatpak runtime? Or is it also a bug in EGL?

That depends -- what's the contents of that egl_vendor.d directory?

marcinx64 commented 8 months ago

Right now it looks like this (those are copies from default directory on host):

ls ~/.local/usr/share/glvnd/egl_vendor.d/ 50_mesa.json 60_nvidia.json

Basically there is no difference if I use "__EGL_VENDOR_LIBRARY_FILENAMES" and specify mesa ICD json file first, or use "__EGL_VENDOR_LIBRARY_DIRS" and point to another dir with changed filename for nvidia (10_nvidia.json -> 60_nvidia.json), the issue is the same.

kbrenneman commented 8 months ago

I'd need to know more about what the application is trying to do to be sure, but my best guess is that it's using an offscreen EGLDisplay, but there's something in Mesa that it can't cope with. Calling something like eglGetDisplay(NULL) will generally hand back an EGLDisplay from whatever vendor library is first.

If you use __EGL_VENDOR_LIBRARY_FILENAMES to limit it to only load Mesa, do you get the same problem?

marcinx64 commented 8 months ago

If you use __EGL_VENDOR_LIBRARY_FILENAMES to limit it to only load Mesa, do you get the same problem?

Tried, unfortunately it is the same behaviour as using __EGL_VENDOR_LIBRARY_DIRS or __EGL_VENDOR_LIBRARY_FILENAMES "reversed".

I'd need to know more about what the application is trying to do to be sure

I can help with this if I would know what You want to check, any specific command output? My system is: Fedora Silverblue 39 Kernel 6.5.6 Nvidia driver 535.113.01 egl-wayland 1.1.12

kbrenneman commented 8 months ago

Tried, unfortunately it is the same behaviour as using __EGL_VENDOR_LIBRARY_DIRS or __EGL_VENDOR_LIBRARY_FILENAMES "reversed".

That's enough to confirm my guess: With Mesa as the first (or only) vendor library, the application ends up using Mesa, and something in Mesa is either failing, missing, or behaving in a way that the application can't cope with. It's probably either a simple app bug or some feature that the app needs which Mesa doesn't have.

Either way, though, that means the problem is outside egl-wayland or the nvidia driver.

jrelvas-ipc commented 8 months ago

Using the search functionality in gnome shell wakes the gpu up. I kid you not.

lmao.webm

The sudden spikes in power consumption I kept experiencing might be explained by this...

kbrenneman commented 8 months ago

Using the search functionality in gnome shell wakes the gpu up. I kid you not.

That with the current version of egl-wayland?

It wouldn't surprise me if the search function spawned a new wayland client process, and if that's all it is, then commit ba6c38a should fix it.

jrelvas-ipc commented 8 months ago

egl-wayland package is version 1.1.12-3.fc39. Is this the latest version?

kbrenneman commented 8 months ago

No, 1.1.13 is the one that has the fix for this: https://github.com/NVIDIA/egl-wayland/releases/tag/1.1.13

Gert-dev commented 8 months ago

I can attest to 1.1.13 not fixing GNOME shell (45) search waking up the dGPU for me, but, since GNOME uses search providers (GNOME characters, nautilus, ...), it seems likely that one or more of those providers are contributing to the problem by hitting one of the aforementioned paths (by accident or by underlying code being called indirectly).

jrelvas-ipc commented 7 months ago

Using the search functionality of gnome shell no longer wakes up the GPU for me on egl-wayland-1.1.13-1.fc39

Fix appears to work as advertised. @kbrenneman

jrelvas-ipc commented 6 months ago

I've reported the wake up issue on Flatpak programs to upstream: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/issues/1683

retrixe commented 6 months ago

I'm also wondering, though, if these specific remaining issues are then also a problem for hybrid GPU setups with an AMD or even Intel dGPU? I have none to test currently, but it might be interesting to mention in upstream reports and make it more testable for developers.

Hard to say. If the driver for the dGPU is Mesa, then it would depend on how Mesa handles device enumeration and selection internally.

For me, nouveau behaves the same as the NVIDIA proprietary driver for me here (experiencing wakeups with Chromium/-based apps, neofetch, GNOME Settings -> About panel), so it's worth noting it's an issue on that side of the fence as well

jrelvas-ipc commented 5 months ago

https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/issues/1683#note_1713305231

Freedesktop upstream says that they don't ship egl-wayland separately; the binary provided by nvidia driver package is used, which is currently still at 1.1.12.

This is why flatpak programs continue to be affected by this bug.

jrelvas-ipc commented 5 months ago

@erik-kz Is egl-wayland 1.1.13 going to be included with the next nvidia driver major release? If not, is there any timeline to do so? Asking to see if it's worth the trouble for freedesktop's runtime to package it separately.

erik-kz commented 5 months ago

Is egl-wayland 1.1.13 going to be included with the next nvidia driver major release?

Yes it will

jrelvas-ipc commented 5 months ago

@erik-kz Did some testing with https://github.com/flathub/org.freedesktop.Platform.GL.nvidia/pull/229 and confirmed that the oudated egl-wayland release was the issue - the updated lib in the 550.40.07 beta driver fixes the wake up issue in Flatpak programs!

Gravação de ecrã a partir de 2024-01-24 23-13-08.webm

Hobbyist11 commented 2 months ago

Still encountering this on certain electron software like Foliate (epub reader) opening Foliate itself doesn't turn the dGPU on but opening an Ebook does. egl-wayland 1.1.13 Nvidia-dkms 550.67-1 kernel 6.8.2

retrixe commented 2 months ago

Foliate isn't an electron app, it's a GTK app which uses a WebView for rendering e-books in particular

My assumption is opening an e-book initialises WebKit2GTK, which probes GPUs to use and ends up initialising the NVIDIA GPU

jrelvas-ipc commented 2 months ago

Foliate isn't an electron app, it's a GTK app which uses a WebView for rendering e-books in particular

My assumption is opening an e-book initialises WebKit2GTK, which probes GPUs to use and ends up initialising the NVIDIA GPU

As a side-note, if the program is using Vulkan, even if it's just to get a list of available gpus, that'd wake up the nvidia dgpu, due to a similar issue with Nvidia's Vulkan implementation. I reported it here: https://forums.developer.nvidia.com/t/550-67-nvidia-vulkan-icd-wakes-up-dgpu-on-initialization-and-exit/288095

jrelvas-ipc commented 2 months ago

CC @erik-kz, since that particular bug is similar to this egl one, but it's with vulkan instead.

Hobbyist11 commented 2 months ago

Foliate isn't an electron app, it's a GTK app which uses a WebView for rendering e-books in particular

My assumption is opening an e-book initialises WebKit2GTK, which probes GPUs to use and ends up initialising the NVIDIA GPU

Oh sorry my bad!