NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
293 stars 47 forks source link

Weird crashes on system with dual NVIDIA dGPUs under GNOME 44 Wayland #82

Open KaleidonKep99 opened 1 year ago

KaleidonKep99 commented 1 year ago

Hello. I am having an issue that is closely related to issue #78.

I am using GNOME 44 under Fedora 38, with the latest NVIDIA drivers from RPMFusion. My computer has two GPUs; the first one is a 1660 Super, which is connected to the first PCIe slot and handles all of my screens, while the second one is a 750 Ti, which I mainly use for small CUDA workloads and for encoding on OBS on Windows.

Since I was thinking about moving from Windows to Linux, I decided to give Fedora a try. I installed it, got the NVIDIA drivers installed through RPMFusion, and it restarted fine. I noticed though that most of the apps wouldn't start up, instead showing the edges of the windows for a split second before disappearing. I tried switching to X11 and that did fix the issue, but since my main screen runs at a high refresh rate, switching to it would mean having the UI locked at 60Hz. I switched back to Wayland, and following the log output from journalctl -f while running one of the applications that crash, I see this error:

...
May 28 15:03:27 [redacted] kernel: [drm:nv_drm_prime_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002300] Failed to allocate fence signaling event
...

Firefox seems to give out more info, claiming that more than one GPU from the same vendor was detected via PCI.

...
May 28 15:08:52 [redacted] kernel: [drm:nv_drm_prime_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002300] Failed to allocate fence signaling event
May 28 15:08:52 [redacted] firefox.desktop[16966]: Crash Annotation GraphicsCriticalError: |[0][GFX1-]: More than 1 GPU from same vendor detected via PCI, cannot deduce device (t=0.215424) |[1][GFX1-]: Wayland protocol error: [destroyed object]: error 7: failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a
May 28 15:08:52 [redacted] firefox.desktop[16966]:  (t=0.634323) [GFX1-]: Wayland protocol error: [destroyed object]: error 7: failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a
May 28 15:08:52 [redacted] firefox[16966]: Error flushing display: Protocol error
May 28 15:08:52 [redacted] firefox.desktop[17072]: Exiting due to channel error.
...

Checking inxi -Fzx, I see that Wayland is running on the system with no GPUs connected to it.

...
Graphics:
  Device-1: NVIDIA GM107 [GeForce GTX 750 Ti] vendor: Gigabyte driver: nvidia
    v: 530.41.03 arch: Maxwell bus-ID: 23:00.0
  Device-2: NVIDIA TU116 [GeForce GTX 1660 SUPER] vendor: Micro-Star MSI
    driver: nvidia v: 530.41.03 arch: Turing bus-ID: 2d:00.0
...
  Display: wayland server: X.Org v: 22.1.9 with: Xwayland v: 22.1.9
    compositor: gnome-shell driver: X: loaded: N/A
    unloaded: fbdev,modesetting,nvidia,vesa gpu: nvidia,nvidia-nvswitch
    resolution: 1: 1920x1080~60Hz 2: 1920x1080~60Hz 3: 2560x1440~180Hz
    4: 1920x1080~60Hz
  API: OpenGL v: 4.6.0 NVIDIA 530.41.03 renderer: NVIDIA GeForce GTX 750
    Ti/PCIe/SSE2 direct-render: Yes
...

I then proceeded to disable the 750 Ti manually, by doing sudo nvidia-smi drain -p 0000:23:00.0 -m 1, and the output from inxi changed to this:

...
  Display: wayland server: X.Org v: 22.1.9 with: Xwayland v: 22.1.9
    compositor: gnome-shell driver: X: loaded: N/A
    unloaded: fbdev,modesetting,nvidia,vesa gpu: nvidia,nvidia-nvswitch
    resolution: 1: 1920x1080~60Hz 2: 1920x1080~60Hz 3: 2560x1440~180Hz
    4: 1920x1080~60Hz
  API: OpenGL v: N/A renderer: N/A direct-render: N/A
...

Weirdly enough though, all the applications that kept crashing earlier, now work fine. Checking with nvidia-smi, they also seem to be rendering on the right GPU with all the screens connected to it:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 S...    Off| 00000000:2D:00.0  On |                  N/A |
| 45%   49C    P0               30W / 125W|   1902MiB /  6144MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2413      G   /usr/bin/gnome-shell                        569MiB |
|    0   N/A  N/A      2960      G   /usr/bin/gnome-software                       4MiB |
|    0   N/A  N/A      3316      G   /usr/libexec/xdg-desktop-portal-gnome         4MiB |
|    0   N/A  N/A      4037      G   /usr/bin/Xwayland                            72MiB |
|    0   N/A  N/A     12703      G   discord-screenaudio                         705MiB |
|    0   N/A  N/A     14059      G   /app/bin/discord-screenaudio                  1MiB |
|    0   N/A  N/A     17193      G   /usr/lib64/firefox/firefox                  126MiB |
+---------------------------------------------------------------------------------------+

My question is, is there a way to force Wayland to use a specific GPU as the main one? Having to disable the 750 Ti means losing my secondary device for CUDA/encoding, which I need for specific workloads.

Full specs of my computer: AMD Ryzen 5900X @ 5GHz MSI MPG X570 Gaming Edge WiFi NVIDIA GeForce GTX 1660 Super NVIDIA GeForce GTX 750 Ti NVIDIA driver 3:530.41.03-1.fc38 Fedora 38 Workstation w/ GNOME 44

Installed Wayland packages:

egl-wayland.x86_64                                              1.1.11-3.fc38
libxcb.i686                                                     1.13.1-11.fc38  
libxcb.x86_64                                                   1.13.1-11.fc38
xorg-x11-server-Xwayland.x86_64                                 22.1.9-2.fc38
erik-kz commented 1 year ago

Looking at the choose_primary_gpu_unchecked function in the mutter code-base, it seems that it will use the boot VGA device by default, or an arbitrary device if none of them have that attribute.

However, it also looks like you can add a "mutter-device-preferred-primary" udev tag to force it to use a particular device. See https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1562

KaleidonKep99 commented 1 year ago

Looking at the choose_primary_gpu_unchecked function in the mutter code-base, it seems that it will use the boot VGA device by default, or an arbitrary device if none of them have that attribute.

However, it also looks like you can add a "mutter-device-preferred-primary" udev tag to force it to use a particular device. See https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1562

Hi. Thank you for your response. I tried adding a udev tag, but it does not seem to make a difference, the system still tries to do everything on the 750 Ti first. The primary boot VGA device is indeed the 1660 Super, since no displays are attached to the 750 Ti, and I also do see the boot screen on the former, but I did notice that the UEFI firmware reports the GOP from the 750 Ti and not from the 1660 Super.

Looking at the lspci output, the 750 Ti seems to be on bus 23:00.0, while the 1660 Super is on bus 2d:00.0. This means that the 750 Ti gets priority when loading the firmware, since it is connected to the chipset, which is the first thing that gets initialized on boot. Could that be the issue?

erik-kz commented 1 year ago

Could we perhaps test that theory by simply swapping the two cards?

KaleidonKep99 commented 1 year ago

That indeed fixes the issue. image

Now the issue is GNOME ignoring the mutter primary setting…

I’ll try some stuff in the meantime. Maybe I missed a crucial step while making the udev rule.

KaleidonKep99 commented 1 year ago

I don't know what's wrong, it seems like I'm doing everything properly, yet my setting is ignored. I am now trying to force the rendering to be on the 750 Ti, and I moved my screens to it as well, but it still renders on the 1660 Super, which is connected to the PCIe x16 slot of the chipset.

Here's the udev rule being applied at boot, I checked with udevadm and it reports the right values:

P: /devices/pci0000:00/0000:00:03.1/0000:2d:00.0/drm/card1
M: card1
R: 1
U: drm
T: drm_minor
D: c 226:1
N: dri/card1
L: 0
S: dri/by-path/pci-0000:2d:00.0-card
E: DEVPATH=/devices/pci0000:00/0000:00:03.1/0000:2d:00.0/drm/card1
E: DEVNAME=/dev/dri/card1
E: DEVTYPE=drm_minor
E: MAJOR=226
E: MINOR=1
E: SUBSYSTEM=drm
E: USEC_INITIALIZED=8723126
E: ID_PATH=pci-0000:2d:00.0
E: ID_PATH_TAG=pci-0000_2d_00_0
E: NVME_HOST_IFACE=none
E: ID_FOR_SEAT=drm-pci-0000_2d_00_0
E: DEVLINKS=/dev/dri/by-path/pci-0000:2d:00.0-card
E: TAGS=:mutter-device-preferred-primary:uaccess:seat:master-of-seat:
E: CURRENT_TAGS=:mutter-device-preferred-primary:uaccess:seat:master-of-seat:

Yet inxi -Fzx still reports the 1660 Super as the main renderer, even with no displays attached to it.

Graphics:
  Device-1: NVIDIA TU116 [GeForce GTX 1660 SUPER] vendor: Micro-Star MSI
    driver: nvidia v: 530.41.03 arch: Turing bus-ID: 23:00.0
  Device-2: NVIDIA GM107 [GeForce GTX 750 Ti] vendor: Gigabyte
    driver: nvidia v: 530.41.03 arch: Maxwell bus-ID: 2d:00.0
...
  Display: wayland server: X.Org v: 22.1.9 with: Xwayland v: 22.1.9
    compositor: gnome-shell driver: gpu: nvidia,nvidia-nvswitch
    resolution: 1920x1080~60Hz
  API: OpenGL v: 4.6.0 NVIDIA 530.41.03 renderer: NVIDIA GeForce GTX 1660
    SUPER/PCIe/SSE2 direct-render: Yes
erik-kz commented 1 year ago

The only other thing I can think of would be to apply the tag to the render node (/dev/dri/renderDXXX) instead of or in addition to the primary node (/dev/dri/card1).

If that doesn't work, it might be worth bringing this up with the GNOME devs. They would probably be able to provide more informed guidance.

Oh yeah, I should also mention that the Failed to allocate fence signaling event error message is safe to ignore. Also it should be gone with the latest 535 driver.

KaleidonKep99 commented 1 year ago

The only other thing I can think of would be to apply the tag to the render node (/dev/dri/renderDXXX) instead of or in addition to the primary node (/dev/dri/card1).

If that doesn't work, it might be worth bringing this up with the GNOME devs. They would probably be able to provide more informed guidance.

Oh yeah, I should also mention that the Failed to allocate fence signaling event error message is safe to ignore. Also it should be gone with the latest 535 driver.

I'll try applying it to RenderD129 instead then. I'll get back with the results asap.

KaleidonKep99 commented 1 year ago

Nothing, same error:

Jun 01 16:37:08 [redacted] systemd[2148]: Started dbus-:1.2-org.gnome.Nautilus@1.service.
Jun 01 16:37:08 [redacted] nautilus[4838]: Connecting to org.freedesktop.Tracker3.Miner.Files
Jun 01 16:37:09 [redacted] gnome-shell[2329]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
Jun 01 16:37:09 [redacted] gnome-shell[2329]: WL: error in client communication (pid 4838)
Jun 01 16:37:09 [redacted] nautilus[4838]: Error flushing display: Protocol error
Jun 01 16:37:09 [redacted] systemd[2148]: Started dbus-:1.2-org.gnome.DiskUtility@1.service.
Jun 01 16:37:09 [redacted] systemd[2148]: dbus-:1.2-org.gnome.Nautilus@1.service: Main process exited, code=exited, status=1/FAILURE
Jun 01 16:37:09 [redacted] systemd[2148]: dbus-:1.2-org.gnome.Nautilus@1.service: Failed with result 'exit-code'.

I'll forward the issue to the GNOME devs.

EDIT: https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/6734