NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
14.17k stars 1.17k forks source link

GNOME Wayland fails to initialize NVIDIA dGPU if secondary screen is directly attached to it in hybrid setups #161

Open Gert-dev opened 2 years ago

Gert-dev commented 2 years ago

NVIDIA Driver Version 515.43.04

GPU NVIDIA RTX 2070 Super Max-Q (dGPU) Intel® Core™ i9-10980HK × 16 (iGPU, using modesetting driver)

Describe the bug In a setup with an Intel iGPU and the NVIDIA acting as dGPU (hybrid graphics), attaching a secondary monitor over DisplayPort on a port wired to the NVIDIA dGPU directly will cause errors on GNOME 42.1, making it fall back to using the NVIDIA dGPU only as KMS device (display target), but not as render device, resulting in very low performance on that monitor due to rendering happening on the iGPU instead.

mutter 42.1 prints:

gnome-shell[1280]: Secondary GPU initialization failed (Failed to create gbm_surface: Operation not permitted). Falling back to GPU-less mode instead, so the secondary monitor may be slow to update.

This is due to the change in this MR, which is a fallback fix as it simply crashed mutter entirely before.

The NVIDIA driver prints:

NVRM osCreateOsDescriptorFromDmaBufPtr: osCreateOsDescriptorFromDmaBufPtr(): Error (86) while trying to import dma_buf!

Additional relevant mutter bug reports:

This issue also exists with the proprietary driver, hence the existing reports, but it looks like the exact same issue exists with this driver.

Also, to clarify, a single display works fine with hybrid graphics. This is not quite the same as #98 as that also applies to a situation with just a single display.

To Reproduce

  1. Use a laptop or device with an Intel iGPU and integrated display, and a DisplayPort (over USB-C, in my case) port wired directly to the NVIDIA dGPU.
  2. Log in to your account on GDM and GNOME 42.1.
  3. The screens will flash, and the secondary monitor eventually turns on, but performance is very poor on it.

Alternatively, just plugging in the secondary monitor after being logged in also exposes the issue.

It might also be reproducible with any single-GPU device that has an external (e.g. Thunderbolt) dGPU with a monitor attached to it.

Expected behavior The above errors are not printed, and mutter is able to render optimally for the secondary monitor without dropping the NVIDIA dGPU. Preferably the NVIDIA dGPU both renders data that needs to be displayed on the secondary monitor and outputs it to the secondary monitor directly without unnecessarily copying back to the iGPU (assuming mutter assigns rendering of content on the secondary monitor to the NVIDIA dGPU as well, so a copy can be avoided when the NVIDIA dGPU acts both as renderer and KMS device).

Please reproduce the problem, run nvidia-bug-report.sh, and attach the resulting nvidia-bug-report.log.gz.

nvidia-bug-report.log.gz

EDIT: Note that the log above is me on my system with only the NVIDIA dGPU active (and not the Intel iGPU), as I have the iGPU disabled through a special BIOS option due to the above problems. If a dump is desired of the Optimus setup, let me know.

aritger commented 2 years ago

Thanks for the detailed bug report. Knowing that it also reproduces with the proprietary kernel modules helps. Could you also check if this is a recent regression across NVIDIA driver releases?

bayasdev commented 2 years ago

Could you also check if this is a recent regression across NVIDIA driver releases?

Screens attached to dGPU have never worked for me on hybrid systems under Wayland, my current device has a GTX 1060 which isn't supported by this open driver however it's also affected.

According to another NVIDIA developer this issue is being tracked on bug 3644077: https://forums.developer.nvidia.com/t/external-monitor-doesnt-work-on-wayland/214090/4

ghtesting2020 commented 2 years ago

Thanks for the detailed bug report. Knowing that it also reproduces with the proprietary kernel modules helps. Could you also check if this is a recent regression across NVIDIA driver releases?

This is also an issue for KDE Plasma on Wayland. It's one of the showstoppers that makes me stuck on X11 and one of the reasons I was looking at getting an AMD card next time. However now that Nvidia is welcoming the open source community and this may get fixed, my plans may change and I can stick with Nvidia.

Gert-dev commented 2 years ago

Could you also check if this is a recent regression across NVIDIA driver releases?

As mentioned by geminis3 above, it is indeed not a regression, and has never worked with the (proprietary) drivers for me either. I've been testing GNOME Wayland with and without Optimus since EGLStreams was added for Wayland support, again on every new stable driver release, and IIRC Optimus on GNOME Wayland didn't work properly at all until GBM support landed a couple of releases ago (November 2021, IIRC), together with which this issue was immediately introduced.

A footnote is that it initially crashed mutter on GNOME entirely when plugging in a second monitor, until they added the graceful degradation in 42.1 (see OP), which simply ignores the NVIDIA dGPU for rendering, and is the state we are in now.

qumaciel commented 2 years ago

Thanks for the detailed bug report. Knowing that it also reproduces with the proprietary kernel modules helps. Could you also check if this is a recent regression across NVIDIA driver releases?

This issue is also present in Sway, and also happens using HDMI, not just DP. Cf. discussion when we tried to devise a general sway setup back then: https://forums.developer.nvidia.com/t/nvidia-495-on-sway-tutorial-questions-arch-based-distros/19221 .

pablodz commented 2 years ago

Same here with an old gtx1050

needlesslygrim commented 2 years ago

This also doesn't work for me on Gnome with the NVIDIA proprietary drivers and the following setup: CPU: AMD Ryzen 5800H GPU: AMD Ryzen 5800H Vega (IGPU, connected to internal display) GPU: NVIDIA RTX 3070 Max-Q (DGPU, connected to all display outs)

I have an external monitor connected via HDMI. If I try to login to Gnome on Wayland the external display will not display anything however X11 works fine.

thesword53 commented 2 years ago

Maybe, it's related to #175

erik-kz commented 2 years ago

This is a known issue with GNOME (or mutter, more specifically). If Sway implemented multi-GPU support in a similar way it would also be affected. The root of the problem is that, while GBM API allows specifying the format of a buffer using DRM format modifiers, this is not sufficient to ensure the buffer can be shared across devices. For instance, there is no way to control whether the buffer is allocated in video memory or system memory. The full and proper solution will probably look something like the "allocation constraints" system described in this XDC talk https://www.youtube.com/watch?v=HZEClOP5TIk

For what it's worth, though, multi-GPU does work with KDE / Kwin because it uses an EGLSurface created from a pitch-linear gbm_surface as the "shared" buffer between devices. In that case, we make the slightly dodgy assumption that it should be allocated in system memory since this is needed for PRIME render offload to work. On the other hand, mutter uses a plain gbm_bo attached to a GL framebuffer object, so such that hack doesn't help.

Gert-dev commented 2 years ago

@erik-kz If I understand correctly, it's a bit like the tranches for DMA-BUF, where you can negotiate specifically what you want and is available, but for buffers in GBM?

Is this issue known upstream already in the form of an issue or discussion thread (Mesa, if I understood correctly that GBM is part of it)? I'm willing to create such an issue upstream if it is convenient, if you can point me to the right project, just to get the gears in motion.

jadahl commented 2 years ago

For what it's worth, though, multi-GPU does work with KDE / Kwin because it uses an EGLSurface created from a pitch-linear gbm_surface as the "shared" buffer between devices

@erik-kz If you mean allocate with GBM_BO_USE_LINEAR then mutter does this too when a buffer is supposed to be shareable from the iGPU to the dGPU.

erik-kz commented 2 years ago

If you mean allocate with GBM_BO_USE_LINEAR then mutter does this too when a buffer is supposed to be shareable from the iGPU to the dGPU.

I think part of the problem with mutter is that the default "COPY_MODE_SECONDARY_GPU" tries to bind the linear shared buffer to GL_TEXTURE_2D on the secondary GPU, which our driver does not support This is indicated by eglQueryDmaBufFormatsEXT setting external_only to true for DRM_FORMAT_MOD_LINEAR.

Looking at the latest Kwin code more closely, the EGLSurface thing I mentioned might not be correct. I think the difference might actually be that Kwin's default works more like mutter's "COPY_MODE_PRIMARY_GPU" where the copy to the shared buffer happens on the primary GPU and then the secondary GPU just imports it and scans it out.

needlesslygrim commented 1 year ago

Is there any progress on this issue?

NoTuxNoBux commented 1 year ago

The error from the OP seems to be slightly different now for me (using new 525 drivers and mutter 43.1):

Secondary GPU initialization failed (Failed to create gbm_surface: No such file or directory). Falling back to GPU-less mode instead, so the secondary monitor may be slow to update.

I also no longer see the Nvidia driver error.

In case it helps, this is a link to the mutter code that fails and logs the warning (this line creates the GBM error). Flags passed to gbm_surface_create appear to be GBM_BO_USE_SCANOUT | GBM_BO_USE_RENDERING.

lisuke commented 1 year ago

OS Arch Linux x86_64 DE Hyprland GPU Intel CoffeeLake-H GT2 [UHD Graphics 630] GPU NVIDIA Quadro RTX 5000 Mobile / Max-Q

[ 1607.885648] NVRM osCreateOsDescriptorFromDmaBufPtr: osCreateOsDescriptorFromDmaBufPtr(): Error (86) while trying to import dma_buf!

w9n commented 1 year ago

[ 1607.885648] NVRM osCreateOsDescriptorFromDmaBufPtr: osCreateOsDescriptorFromDmaBufPtr(): Error (86) while trying to import dma_buf!

same issue on sway with nvidia egpu over thunderbolt.

CoelacanthusHex commented 1 year ago
Graphics:
  Device-1: NVIDIA GA107BM [GeForce RTX 3050 Mobile] driver: nvidia
    v: 525.89.02
  Device-2: AMD Cezanne [Radeon Vega Series / Radeon Mobile Series]
    driver: amdgpu v: kernel
  Device-3: IMC Networks Integrated Camera type: USB driver: uvcvideo
  Display: wayland server: X.org v: 1.21.1.7 with: Xwayland v: 22.1.7
    compositor: kwin_wayland driver: X: loaded: modesetting unloaded: nvidia
    dri: radeonsi,nouveau gpu: nvidia,amdgpu resolution: 1920x1080
  API: OpenGL v: 4.6 Mesa 22.3.4 renderer: AMD Radeon Graphics (renoir LLVM
    15.0.7 DRM 3.49 6.1.11-zen1-1-zen)

OS Arch Linux x86_64 DE KDE Wayland NVRM osCreateOsDescriptorFromDmaBufPtr: osCreateOsDescriptorFromDmaBufPtr(): Error (86) while trying to import dma_buf!

petersaints commented 1 year ago

I know that it's not an ideal solution, but other than using X11 instead of Wayland, you can just disable the Intel GPU if your laptop allows it. On my laptop I'm able to do that it runs well if both the laptop screen and the external monitor run from the NVIDIA GPU. However, this is not an ideal solution since you will use more battery when you use your laptop disconnected from any screen and you will always have to reboot to change this setting.

ankurdhama commented 1 year ago

In version 535 the error message is Secondary GPU initialization failed (Failed to create gbm_surface: Function not implemented).

ankurdhama commented 7 months ago

Is there any progress on this issue in 545 or do we have to wait few more decades ?

Mouwrice commented 7 months ago

Is there any progress on this issue in 545 or do we have to wait few more decades ?

@ankurdhama fixes are being made around this issue on the gnome / mutter side of things.

There are multiple issues and MR's open but this should give you a good entry point: https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3304#note_1910885

But if it depends on Nvidia then we are probably better of switching to a respectable company that does not abuse its customers.

vanvugt commented 5 months ago

Yes a complete fix is coming in mutter!3304. But part of that is a workaround for an Nvidia driver quirk causing the Failed to create gbm_surface.

Also I suspect "open-gpu-kernel-modules" is the wrong NVIDIA project to track this in.

ankurdhama commented 2 months ago

I had a look at https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3304 and it seems that the fix in that PR is to make mutter use "copy mode gpu" instead of much slower "copy mode cpu". In "copy mode gpu" it seems that mutter is using a shader program to copy the buffer from system ram to GPU framebuffer. This does improve the performance a bit but I think the end goal is to use "copy mode zero" for sharing the buffer between igpu and nvidia dgpu. I am not sure what changes Nvidia driver needs to make so that we can get "copy mode zero" working or is it even possible in a igpu dgpu machine. Is it that nvidia driver needs to support dma buf import to make this happen?

@vanvugt does the above summary looks correct?